Did-You-Lie Deception Probe for Qwen/Qwen3.5-122B-A10B-FP8

A probe trained to detect deceptive behaviour in Qwen/Qwen3.5-122B-A10B-FP8 using residual stream activations on a "did you lie?" follow-up question. We emphasise that we have not found that these probes reliably classify deception, and they may therefore be best suited to baselining other work.

Quick Start

from deception import get_probe_hf  # Will work when open sourced

probe = get_probe_hf("ai-safety-institute/dyl-qwen-qwen3.5-122b-a10b-fp8")

The default checkpoint is the best performer from the hyperparameter sweep (l_28_lm_100_ar_lr.pt). To pick a specific checkpoint, pass filename=:

probe = get_probe_hf("ai-safety-institute/dyl-qwen-qwen3.5-122b-a10b-fp8", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")

Use sweep.json to see all 336 available checkpoints and their metrics.

Model Details

Property Value
Target model Qwen/Qwen3.5-122B-A10B-FP8
Safe name Qwen--Qwen3.5-122B-A10B-FP8
Default checkpoint l_28_lm_100_ar_lr.pt
Available checkpoints 336
Calibration Threshold set at 1% FPR on Alpaca (honest baseline)

Training Data

Probes are trained on the city_countries dataset (model asked to lie about geographical facts) and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.

Citation

Trained Probes

@misc{cooney2025dylprobes,
      title={Did-You-Lie Deception Probes},
      author={Alan Cooney},
      year={2025},
      url={https://huggingface.co/collections/ai-safety-institute/did-you-lie-probes},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ai-safety-institute/dyl-qwen-qwen3.5-122b-a10b-fp8