Qwen3.5-27B Agent GRPO v2 — Reinforcement Learning LoRA

A LoRA adapter trained with GRPO (Group Relative Policy Optimization) reinforcement learning on top of the SFT v2 model, for grounded image generation agent tasks.

Model Details

Parameter	Value
Base model	Qwen3.5-27B + SFT v2 LoRA
LoRA rank	32
Algorithm	GRPO (Search-R1 style)
Group size	5 rollouts per prompt
Temperature	1.0
Training steps	200
Hardware	4x H200 GPUs (3,4,6,7)

Reward Function (ToolRLA-Inspired Multiplicative Decomposition)

Key Technical Decisions

Zero-Variance Fix

Initial GRPO run had (all rollouts got identical reward, zero gradients). Fixed by switching from binary to continuous multiplicative reward with random noise component. Result: , reward_std ~0.05, gradients flowing.

DAPO Improvements Applied

Dynamic Sampling: Filter zero-variance groups
Clip-Higher: Asymmetric clipping for exploration
Token-level PG loss: Better for long ReAct traces

Retrieved Token Masking (Search-R1)

Tool output tokens excluded from policy gradients — RL signal only updates reasoning and tool-selection decisions.

Training Metrics

Step	Reward Mean	Reward Std	Loss
1	0.575	0.062	-0.012
50	0.569	0.054	-0.005
100	0.578	0.077	-0.005
150	0.576	0.045	-0.006
200	0.583	0.039	-0.004

Evaluation (3-Way: Base vs SFT v2 vs RL v2)

Test	Base	SFT v2	RL v2
Real location (NTU Hive)	0.70	0.85	0.70
Multi-step (Marina Bay Sands)	0.70	1.00	0.85
Database search	1.00	1.00	1.00
Fictional scene	1.00	1.00	1.00
No tools needed	0.75	0.75	0.75
Full pipeline	0.70	0.85	0.85
Average	0.81	0.91	0.86