Anchor: Bilateral Negotiation via RLVR

Replication and extension of "Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards" (paper 2604.09855).

Goal: train a small open model (Qwen3-4B/8B family) to negotiate effectively under incomplete information on a tight compute budget.

Active scripts

File	Purpose
`train_negotiation_pure.py`	Current pure Negotiation-RLVR training script. Buyer-only GRPO against a frozen regulated seller, starting from the optimized SPIRAL/RLVR infrastructure but with all self-play/RAE/seller-training pieces removed.
`train_negotiation_sdpo.py`	SDPO+GRPO buyer-only experiment. Forks the pure script, defaults to Qwen3-8B, and adds strict feedback-conditioned self-teacher credit assignment while preserving the frozen regulated seller setup.
`train_negotiation_dual_role.py`	SPIRAL/RLVR hybrid self-play script. Shared policy plays buyer + seller with RAE. Kept for comparison and future dual-role experiments.
`eval_negotiation.py`	Evaluation script for trained checkpoints against a frozen seller under the paper-style protocol.
`JOURNAL.md`	Engineering log: papers, bugs, runs, hyperparameters, job IDs, and rationale.

train_negotiation_clean.py was the old pure buyer-only prototype and has been deleted. It predated Thought stripping, batched generation, monitoring, checkpoint branches, and later memory/stability fixes.

Pure Negotiation-RLVR setup

train_negotiation_pure.py follows the negotiation paper architecture:

Trainable buyer policy.
Frozen seller model as the environment counterparty.
Buyer starts every episode; only buyer turns receive policy-gradient updates.
Seller is regulated: it cannot accept/propose below its private cost.
Opponent-visible dialogue strips Thought: blocks, preserving the paper's hidden-scratchpad / incomplete-information assumption.
Reward: R = (budget - P_final) / |budget - cost|, clipped to [-1, 1]; no-deal/quit/timeout = 0; buyer format/budget/protocol errors = -1.

Non-conflicting improvements retained from the SPIRAL/RLVR script:

Batched turn-parallel generation.
Runtime private-info leak guards.
Memory-efficient token log-prob computation.
Clamped log-ratio for GRPO stability.
Group-level advantage normalization.
Optional Liger kernels.
W&B logging + alerts.
Periodic HF Hub branch checkpoints.

Explicitly removed from the SPIRAL setup:

Shared-policy self-play.
Seller training.
Zero-sum seller reward.
RAE / per-role baselines.
Dual-role ratio.

Default initial pure run

The 42-iteration pure run has completed and pushed to ZeterMordio/anchor-negotiation-pure. Its launch defaults were:

Setting	Default
Model	`Qwen/Qwen3-4B-Instruct-2507`
Iterations	`42`
Batch size	`16` products
Group size	`8` rollouts/product
Episodes / iter	`128`
Max turns	`6`
Max tokens / response	`300`
LR	`1e-6`
KL anchor	`0.01` (pure script only; SDPO is now ref-free)
Buyer temp	`1.0`
Seller temp	`0.7`
Checkpoint every	`10` iters
Recommended hardware	`a100-large`

Local syntax check

python3 -m py_compile train_negotiation_pure.py train_negotiation_sdpo.py train_negotiation_dual_role.py eval_negotiation.py

eval_negotiation.py uses the same 18-category AmazonHistoryPrice loader and the same 802/128 seed split as the training scripts. Override TRAIN_SPLIT_SIZE / TEST_SPLIT_SIZE only for controlled split-parity experiments.

SDPO+GRPO 8B run

train_negotiation_sdpo.py is the intended first self-distillation experiment. It keeps the pure buyer-only environment but uses hindsight verifier feedback and same-product on-policy rollout demos to compute dense SDPO token advantages. Current real-run defaults are deliberately bolder than the initial canaries: SDPO_LAMBDA=0.5 gives equal weight to the GRPO scalar reward and SDPO token shaping; SDPO_FEEDBACK_MODE=strict avoids exact seller-cost leakage.

Earlier smoke validation completed on A100 with Qwen3-4B-Instruct-2507 and full format settings (MAX_TURNS=6, MAX_NEW_TOKENS=300): job 6a05a28a3308d79117b8f560, model repo ZeterMordio/anchor-negotiation-sdpo-smoke-fullfmt. It completed 1 iteration and pushed metrics/model.

Qwen3.5 reasoning ablations use REASONING_MODE: option_a (default) disables native chat-template thinking and keeps the explicit private Thought: field, while hard-canonicalizing public text to Talk: plus the first Action: line. option_b enables native Qwen thinking (enable_thinking=True in chat-template calls) and switches the role prompts to a native protocol: the model reasons inside private <think>...</think>, then outputs only public Talk: + Action:. NATIVE_PUBLIC_FINALIZER=1 (default for option_b) lets Qwen spend NATIVE_THINK_TOKENS on hidden reasoning, forcibly closes the private block if needed, and runs a short NATIVE_FINAL_TOKENS same-assistant continuation prefilled only with Talk: for the final public Talk/Action so longer reasoning budgets do not become format errors. Generated native-think blocks are kept in same-role private assistant history and stripped before opponent visibility/action parsing. The script logs native/explicit reasoning marker counts per iteration and can print raw/public buyer samples with DEBUG_SAMPLE_BUYER_OUTPUTS. The Qwen3.5 parser now handles trailing action punctuation such as $25.00., and the privacy guard still fails on structured buyer Shopping List/budget_limit leakage while allowing harmless public mentions of the phrase “shopping list”.

The SDPO update path is optimized for the 8B run: buyer turns are flattened into pre-tokenized microbatch examples, teacher/student completion masks are aligned row-wise, and CPU-state AdamW steps once per production-shape iteration by default (UPDATE_MICROBATCH_SIZE=4, OPTIM_STEP_EVERY_GROUPS=16). It is now a true ref-free/on-policy objective (KL_COEF=0.0 by default): the update does not load a frozen reference-policy model or run a reference forward; it uses sampled-token policy-gradient loss -A * logπ over buyer completion tokens, while the SDPO self-teacher remains the current buyer model under hindsight feedback. Current serious-run optimizer defaults are NUM_ITERS=60, LR=5e-6 with WARMUP_STEPS=10, WEIGHT_DECAY=0.01, GRAD_CLIP_NORM=1.0, and ROLLOUT_MAX_LENGTH=UPDATE_MAX_LENGTH=3072. W&B logs per-phase update timers under perf/update_*_s plus perf/update_examples, perf/optimizer_steps, train/optimizer_global_step, train/lr, and train/grad_norm_last; perf/update_ref_forward_s is retained as a zero-valued compatibility metric. The ref-free smoke completed at hf job 6a0a6998e7940de6ee6cdfa3, W&B run itnu5od5, and model/metrics repo ZeterMordio/anchor-negotiation-sdpo-ref-free-smoke. The previous pre-ref-free GPU smoke completed at hf job 6a0a5fd2a5e509f1a8413e2b, W&B run fnscexas, and model/metrics repo ZeterMordio/anchor-negotiation-sdpo-perf-smoke.

hf jobs uv run \
  --flavor a100-large \
  --timeout 12h \
  --with torch==2.6.0 \
  --with transformers \
  --with accelerate \
  --with huggingface_hub \
  --with wandb \
  --with liger-kernel \
  --secrets HF_TOKEN \
  --secrets WANDB_API_KEY \
  --env MODEL_NAME=Qwen/Qwen3-8B \
  --env SELLER_MODEL_NAME=Qwen/Qwen3-8B \
  --env NUM_ITERS=60 \
  --env BATCH_SIZE=16 \
  --env GROUP_SIZE=8 \
  --env MAX_TURNS=6 \
  --env MAX_NEW_TOKENS=300 \
  --env LR=5e-6 \
  --env WEIGHT_DECAY=0.01 \
  --env WARMUP_STEPS=10 \
  --env GRAD_CLIP_NORM=1.0 \
  --env KL_COEF=0.0 \
  --env SDPO_LAMBDA=0.5 \
  --env SDPO_FEEDBACK_MODE=strict \
  --env SDPO_ADV_CLIP=5.0 \
  --env BUYER_TEMP=1.0 \
  --env SELLER_TEMP=0.7 \
  --env CHECKPOINT_EVERY=10 \
  --env GEN_BATCH_LIMIT=128 \
  --env UPDATE_MICROBATCH_SIZE=4 \
  --env OPTIM_STEP_EVERY_GROUPS=16 \
  --env UPDATE_PAD_TO_MULTIPLE_OF=8 \
  --env ROLLOUT_MAX_LENGTH=3072 \
  --env UPDATE_MAX_LENGTH=3072 \
  --env GRADIENT_CHECKPOINTING=1 \
  --env HUB_MODEL_ID=ZeterMordio/anchor-negotiation-sdpo \
  --env WANDB_ENTITY=chalk \
  --env WANDB_PROJECT=anchor-negotiation-sdpo \
  --env PYTHONUNBUFFERED=1 \
  --detach \
  train_negotiation_sdpo.py

W&B monitoring

The scripts now use Weights & Biases instead of Trackio.

WANDB_ENTITY is the W&B account or team namespace that owns the run. For the available API key, the default entity is chalk.
WANDB_PROJECT is the project bucket/group inside that entity. Projects are created automatically on first run; no web-page setup is required.
RUN_NAME is the human-readable run name. Leave unset for the standardized auto-name; set it only for one-off manual labels.
WANDB_GROUP groups related reruns/ablations. Leave unset for the standardized auto-group.
WANDB_MODE=online logs to the W&B web app; use WANDB_MODE=offline for local dry runs.
HF Jobs must receive --secrets WANDB_API_KEY in addition to --secrets HF_TOKEN.

Best practice for this repo: keep run names short enough to scan, but put full detail in wandb.config.

Run-name schema:

Script	Default name shape
Pure	`pure__<model>__i<iters>_b<batch>xg<group>__lr<lr>_kl<kl>__s<seed>`
SDPO/SDRO	`sdpo__<model>__l<lambda>__<distill>__i<iters>_b<batch>xg<group>__fb<mode>_clip<clip>__lr<lr>_kl<kl>__s<seed>`
Dual-role	`dual__<model>__i<iters>_b<batch>xg<group>__dr<ratio>_rae<decay>_<ref>_<advnorm>__lr<lr>_kl<kl>`

For SDPO/SDRO, <distill> is currently tokgap. Future logit-level variants are named like topk64-js-tri-ema0p99.

Put in the title: method, model, lambda, distillation family, iters, batch×group, feedback mode, clip, LR, KL, seed. Put everything else in W&B config/tags: max turns/tokens, temps, optimizer, CPU/CUDA AdamW, Liger, memory caps, demo length, exact divergence, EMA/trust-region flags, Hub repo, git commit, hardware, etc.

Example run URLs will look like https://wandb.ai/chalk/anchor-negotiation-sdpo/runs/<run-id>.

HF Jobs launch command (do not run unless ready)

Use the lightweight uv flow and exact torch==2.6.0 pin. The <2.7 form caused shell-redirection failures via the Jobs API.

hf jobs uv run \
  --flavor a100-large \
  --timeout 8h \
  --with torch==2.6.0 \
  --with transformers \
  --with accelerate \
  --with huggingface_hub \
  --with wandb \
  --with liger-kernel \
  --secrets HF_TOKEN \
  --secrets WANDB_API_KEY \
  --env MODEL_NAME=Qwen/Qwen3-4B-Instruct-2507 \
  --env SELLER_MODEL_NAME=Qwen/Qwen3-4B-Instruct-2507 \
  --env NUM_ITERS=42 \
  --env BATCH_SIZE=16 \
  --env GROUP_SIZE=8 \
  --env MAX_TURNS=6 \
  --env MAX_NEW_TOKENS=300 \
  --env LR=1e-6 \
  --env KL_COEF=0.01 \
  --env BUYER_TEMP=1.0 \
  --env SELLER_TEMP=0.7 \
  --env CHECKPOINT_EVERY=10 \
  --env GEN_BATCH_LIMIT=128 \
  --env GRADIENT_CHECKPOINTING=1 \
  --env HUB_MODEL_ID=ZeterMordio/anchor-negotiation-pure \
  --env WANDB_ENTITY=chalk \
  --env WANDB_PROJECT=anchor-negotiation-pure \
  --env PYTHONUNBUFFERED=1 \
  --detach \
  train_negotiation_pure.py

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'ZeterMordio/anchor-negotiation-sdpo-qwen35-smoke'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: 260

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ZeterMordio/anchor-negotiation-sdpo-qwen35-smoke

Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

Paper • 2604.09855 • Published Apr 10