PyTorch
deception-detection
probe
unrelated-questions

Unrelated Questions Deception Probe for Qwen/Qwen3-14B

A probe trained to detect deceptive behaviour in Qwen/Qwen3-14B using logprob differences from yes/no follow-up questions, following the methodology from How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions (Pacchiardi et al., ICLR 2024). We emphasise that we have not found that these probes reliably classify deception, and they may therefore be best suited to baselining other work.

Quick Start

from deception import get_probe_hf  # Will work when open sourced

probe = get_probe_hf("ai-safety-institute/uq-qwen-qwen3-14b")

The default checkpoint is the best performer from the hyperparameter sweep (ar_mlp_wd_0_001_lr_0_0001_ep_50.pt). To pick a specific checkpoint, pass filename=:

probe = get_probe_hf("ai-safety-institute/uq-qwen-qwen3-14b", filename="ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")

Use sweep.json to see all 40 available checkpoints and their metrics.

Computing Features

Features are logprob differences from a fixed set of yes/no follow-up questions. For each sample, after the model produces its response:

  1. Append each follow-up question as a new user message.
  2. Generate 1 token with e.g. logprobs=True, top_logprobs=20 if using vLLM.
  3. Compute logsumexp(yes_logprobs) - logsumexp(no_logprobs) for yes/no token variants.
  4. Stack into a feature vector.

The question list lives in deception/detectors/unrelated_questions/questions.py.

Model Details

Property Value
Target model Qwen/Qwen3-14B
Safe name Qwen--Qwen3-14B
Default checkpoint ar_mlp_wd_0_001_lr_0_0001_ep_50.pt
Available checkpoints 40
Calibration Threshold set at 1% FPR on Alpaca (honest baseline)

Training Data

Probes are trained on the SciQ dataset (CC-BY-NC-3.0 license; model asked factual questions, with some answers being deceptive) and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.

Citation

Original Paper

@inproceedings{pacchiardi2024catchailiar,
      title={How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions},
      author={Lorenzo Pacchiardi and Alex J. Chan and Sören Mindermann and Ilan Moscovitz and Alejandro Pan and Yarin Gal and Owain Evans and Jan Brauner},
      year={2024},
      booktitle={International Conference on Learning Representations},
      url={https://arxiv.org/abs/2309.15840},
}

Trained Probes

@misc{cooney2025uqprobes,
      title={Unrelated Questions Deception Probes},
      author={Alan Cooney},
      year={2025},
      url={https://huggingface.co/collections/ai-safety-institute/catch-a-liar-unrelated-questions-classifier},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ai-safety-institute/uq-qwen-qwen3-14b

Papers for ai-safety-institute/uq-qwen-qwen3-14b