--- tags: - deception-detection - probe - unrelated-questions library_name: pytorch license: mit --- # Unrelated Questions Deception Probe for meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_reward_wireheading A probe trained to detect deceptive behaviour in **meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_reward_wireheading** using logprob differences from yes/no follow-up questions, following the methodology from [How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions (Pacchiardi et al., ICLR 2024)](https://arxiv.org/abs/2310.09492). We emphasise that we have **not found that these probes reliably classify deception**, and they may therefore be best suited to baselining other work. ## Quick Start ```python from deception import get_probe_hf # Will work when open sourced probe = get_probe_hf("ai-safety-institute/uq-meta-llama-llama-3.3-70b-instruct__aa-kto-reward_wireheading") ``` The default checkpoint is the best performer from the hyperparameter sweep (`lm_0_1_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`: ```python probe = get_probe_hf("ai-safety-institute/uq-meta-llama-llama-3.3-70b-instruct__aa-kto-reward_wireheading", filename="ar_mlp_wd_0_001_lr_0_0001_ep_100.pt") ``` Use `sweep.json` to see all 40 available checkpoints and their metrics. ### Computing Features Features are logprob differences from a fixed set of yes/no follow-up questions. For each sample, after the model produces its response: 1. Append each follow-up question as a new user message. 2. Generate 1 token with e.g. `logprobs=True, top_logprobs=20` if using vLLM. 3. Compute `logsumexp(yes_logprobs) - logsumexp(no_logprobs)` for yes/no token variants. 4. Stack into a feature vector. The question list lives in [deception/detectors/unrelated_questions/questions.py](https://github.com/AISI/deception/blob/main/deception/detectors/unrelated_questions/questions.py). ## Model Details | Property | Value | | --- | --- | | Target model | `meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_reward_wireheading` | | Safe name | `meta-llama--Llama-3.3-70B-Instruct@auditing-agents--llama_70b_synth_docs_only_then_redteam_kto_reward_wireheading` | | Default checkpoint | `lm_0_1_ar_lr.pt` | | Available checkpoints | 40 | | Calibration | Threshold set at 1% FPR on Alpaca (honest baseline) | ## Training Data Probes are trained on the [SciQ](https://huggingface.co/datasets/allenai/sciq) dataset ([CC-BY-NC-3.0 license](https://spdx.org/licenses/CC-BY-NC-3.0); model asked factual questions, with some answers being deceptive) and calibrated on [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (honest-only baseline) to achieve a 1% false positive rate. ## Citation ### Original Paper ```bibtex @inproceedings{pacchiardi2024catchailiar, title={How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions}, author={Lorenzo Pacchiardi and Alex J. Chan and Sören Mindermann and Ilan Moscovitz and Alejandro Pan and Yarin Gal and Owain Evans and Jan Brauner}, year={2024}, booktitle={International Conference on Learning Representations}, url={https://arxiv.org/abs/2309.15840}, } ``` ### Trained Probes ```bibtex @misc{cooney2025uqprobes, title={Unrelated Questions Deception Probes}, author={Alan Cooney}, year={2025}, url={https://huggingface.co/collections/ai-safety-institute/catch-a-liar-unrelated-questions-classifier}, } ```