Did-You-Lie Deception Probe for Qwen/Qwen3.5-122B-A10B-FP8

A probe trained to detect deceptive behaviour in Qwen/Qwen3.5-122B-A10B-FP8 using residual stream activations on a "did you lie?" follow-up question. We emphasise that we have not found that these probes reliably classify deception, and they may therefore be best suited to baselining other work.

Quick Start

from deception import get_probe_hf  # Will work when open sourced

probe = get_probe_hf("ai-safety-institute/dyl-qwen-qwen3.5-122b-a10b-fp8")

The default checkpoint is the best performer from the hyperparameter sweep (l_28_lm_100_ar_lr.pt). To pick a specific checkpoint, pass filename=:

probe = get_probe_hf("ai-safety-institute/dyl-qwen-qwen3.5-122b-a10b-fp8", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")

Use sweep.json to see all 336 available checkpoints and their metrics.

Model Details

Property	Value
Target model	`Qwen/Qwen3.5-122B-A10B-FP8`
Safe name	`Qwen--Qwen3.5-122B-A10B-FP8`
Default checkpoint	`l_28_lm_100_ar_lr.pt`
Available checkpoints	336
Calibration	Threshold set at 1% FPR on Alpaca (honest baseline)

Training Data

Probes are trained on the city_countries dataset (model asked to lie about geographical facts) and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.

Citation

Trained Probes

@misc{cooney2025dylprobes,
      title={Did-You-Lie Deception Probes},
      author={Alan Cooney},
      year={2025},
      url={https://huggingface.co/collections/ai-safety-institute/did-you-lie-probes},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ai-safety-institute/dyl-qwen-qwen3.5-122b-a10b-fp8

Did You Lie Probes

Collection

Probes for the forthcoming paper - Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms • 56 items • Updated about 9 hours ago