Did You Lie Probes
Collection
Probes for the forthcoming paper - Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms • 56 items • Updated
A probe trained to detect deceptive behaviour in Qwen/Qwen3.5-122B-A10B-FP8 using residual stream activations on a "did you lie?" follow-up question. We emphasise that we have not found that these probes reliably classify deception, and they may therefore be best suited to baselining other work.
from deception import get_probe_hf # Will work when open sourced
probe = get_probe_hf("ai-safety-institute/dyl-qwen-qwen3.5-122b-a10b-fp8")
The default checkpoint is the best performer from the hyperparameter sweep (l_28_lm_100_ar_lr.pt). To pick a specific checkpoint, pass filename=:
probe = get_probe_hf("ai-safety-institute/dyl-qwen-qwen3.5-122b-a10b-fp8", filename="l_40_ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
Use sweep.json to see all 336 available checkpoints and their metrics.
| Property | Value |
|---|---|
| Target model | Qwen/Qwen3.5-122B-A10B-FP8 |
| Safe name | Qwen--Qwen3.5-122B-A10B-FP8 |
| Default checkpoint | l_28_lm_100_ar_lr.pt |
| Available checkpoints | 336 |
| Calibration | Threshold set at 1% FPR on Alpaca (honest baseline) |
Probes are trained on the city_countries dataset (model asked to lie about geographical facts) and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.
@misc{cooney2025dylprobes,
title={Did-You-Lie Deception Probes},
author={Alan Cooney},
year={2025},
url={https://huggingface.co/collections/ai-safety-institute/did-you-lie-probes},
}