ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents
Paper • 2603.01620 • Published
How to use Rabornkraken/qwen3.5-27b-agent-grpo-v2 with PEFT:
Task type is invalid.
A LoRA adapter trained with GRPO (Group Relative Policy Optimization) reinforcement learning on top of the SFT v2 model, for grounded image generation agent tasks.
| Parameter | Value |
|---|---|
| Base model | Qwen3.5-27B + SFT v2 LoRA |
| LoRA rank | 32 |
| Algorithm | GRPO (Search-R1 style) |
| Group size | 5 rollouts per prompt |
| Temperature | 1.0 |
| Training steps | 200 |
| Hardware | 4x H200 GPUs (3,4,6,7) |
Initial GRPO run had (all rollouts got identical reward, zero gradients). Fixed by switching from binary to continuous multiplicative reward with random noise component. Result: , reward_std ~0.05, gradients flowing.
Tool output tokens excluded from policy gradients — RL signal only updates reasoning and tool-selection decisions.
| Step | Reward Mean | Reward Std | frac_zero_std | Loss |
|---|---|---|---|---|
| 1 | 0.575 | 0.062 | 0.0 | -0.012 |
| 50 | 0.569 | 0.054 | 0.0 | -0.005 |
| 100 | 0.578 | 0.077 | 0.0 | -0.005 |
| 150 | 0.576 | 0.045 | 0.0 | -0.006 |
| 200 | 0.583 | 0.039 | 0.0 | -0.004 |
| Test | Base | SFT v2 | RL v2 |
|---|---|---|---|
| Real location (NTU Hive) | 0.70 | 0.85 | 0.70 |
| Multi-step (Marina Bay Sands) | 0.70 | 1.00 | 0.85 |
| Database search | 1.00 | 1.00 | 1.00 |
| Fictional scene | 1.00 | 1.00 | 1.00 |
| No tools needed | 0.75 | 0.75 | 0.75 |
| Full pipeline | 0.70 | 0.85 | 0.85 |
| Average | 0.81 | 0.91 | 0.86 |
| Comparison | Delta |
|---|---|
| SFT v2 vs Base | +10% |
| RL vs Base | +5% |
| RL vs SFT v2 | -5% |
Base model
Qwen/Qwen3.5-27B