Unrelated Questions Deception Probe for Qwen/Qwen3-14B
A probe trained to detect deceptive behaviour in Qwen/Qwen3-14B using logprob differences from yes/no follow-up questions, following the methodology from How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions (Pacchiardi et al., ICLR 2024). We emphasise that we have not found that these probes reliably classify deception, and they may therefore be best suited to baselining other work.
Quick Start
from deception import get_probe_hf # Will work when open sourced
probe = get_probe_hf("ai-safety-institute/uq-qwen-qwen3-14b")
The default checkpoint is the best performer from the hyperparameter sweep (ar_mlp_wd_0_001_lr_0_0001_ep_50.pt). To pick a specific checkpoint, pass filename=:
probe = get_probe_hf("ai-safety-institute/uq-qwen-qwen3-14b", filename="ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
Use sweep.json to see all 40 available checkpoints and their metrics.
Computing Features
Features are logprob differences from a fixed set of yes/no follow-up questions. For each sample, after the model produces its response:
- Append each follow-up question as a new user message.
- Generate 1 token with e.g.
logprobs=True, top_logprobs=20if using vLLM. - Compute
logsumexp(yes_logprobs) - logsumexp(no_logprobs)for yes/no token variants. - Stack into a feature vector.
The question list lives in deception/detectors/unrelated_questions/questions.py.
Model Details
| Property | Value |
|---|---|
| Target model | Qwen/Qwen3-14B |
| Safe name | Qwen--Qwen3-14B |
| Default checkpoint | ar_mlp_wd_0_001_lr_0_0001_ep_50.pt |
| Available checkpoints | 40 |
| Calibration | Threshold set at 1% FPR on Alpaca (honest baseline) |
Training Data
Probes are trained on the SciQ dataset (CC-BY-NC-3.0 license; model asked factual questions, with some answers being deceptive) and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.
Citation
Original Paper
@inproceedings{pacchiardi2024catchailiar,
title={How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions},
author={Lorenzo Pacchiardi and Alex J. Chan and Sören Mindermann and Ilan Moscovitz and Alejandro Pan and Yarin Gal and Owain Evans and Jan Brauner},
year={2024},
booktitle={International Conference on Learning Representations},
url={https://arxiv.org/abs/2309.15840},
}
Trained Probes
@misc{cooney2025uqprobes,
title={Unrelated Questions Deception Probes},
author={Alan Cooney},
year={2025},
url={https://huggingface.co/collections/ai-safety-institute/catch-a-liar-unrelated-questions-classifier},
}