Qwen3.5-27B Agent GRPO v2 — Reinforcement Learning LoRA

A LoRA adapter trained with GRPO (Group Relative Policy Optimization) reinforcement learning on top of the SFT v2 model, for grounded image generation agent tasks.

Model Details

Parameter Value
Base model Qwen3.5-27B + SFT v2 LoRA
LoRA rank 32
Algorithm GRPO (Search-R1 style)
Group size 5 rollouts per prompt
Temperature 1.0
Training steps 200
Hardware 4x H200 GPUs (3,4,6,7)

Reward Function (ToolRLA-Inspired Multiplicative Decomposition)

Key Technical Decisions

Zero-Variance Fix

Initial GRPO run had (all rollouts got identical reward, zero gradients). Fixed by switching from binary to continuous multiplicative reward with random noise component. Result: , reward_std ~0.05, gradients flowing.

DAPO Improvements Applied

  • Dynamic Sampling: Filter zero-variance groups
  • Clip-Higher: Asymmetric clipping for exploration
  • Token-level PG loss: Better for long ReAct traces

Retrieved Token Masking (Search-R1)

Tool output tokens excluded from policy gradients — RL signal only updates reasoning and tool-selection decisions.

Training Metrics

Step Reward Mean Reward Std frac_zero_std Loss
1 0.575 0.062 0.0 -0.012
50 0.569 0.054 0.0 -0.005
100 0.578 0.077 0.0 -0.005
150 0.576 0.045 0.0 -0.006
200 0.583 0.039 0.0 -0.004

Evaluation (3-Way: Base vs SFT v2 vs RL v2)

Test Base SFT v2 RL v2
Real location (NTU Hive) 0.70 0.85 0.70
Multi-step (Marina Bay Sands) 0.70 1.00 0.85
Database search 1.00 1.00 1.00
Fictional scene 1.00 1.00 1.00
No tools needed 0.75 0.75 0.75
Full pipeline 0.70 0.85 0.85
Average 0.81 0.91 0.86
Comparison Delta
SFT v2 vs Base +10%
RL vs Base +5%
RL vs SFT v2 -5%

Usage

References

  • Search-R1 - Retrieved token masking
  • DAPO - Dynamic sampling, Clip-Higher
  • ToolRL - Fine-grained tool reward design
  • ToolRLA - Multiplicative reward decomposition
  • OpenClaw-RL - Binary RL + OPD

Infrastructure

  • 4x NVIDIA H200 (144GB VRAM each)
  • Part of SC4062 Grounded Image Generation Agent project
Downloads last month
-
Video Preview
loading

Model tree for Rabornkraken/qwen3.5-27b-agent-grpo-v2

Base model

Qwen/Qwen3.5-27B
Adapter
(77)
this model

Papers for Rabornkraken/qwen3.5-27b-agent-grpo-v2