---
tags:
- deception-detection
- probe
- unrelated-questions
library_name: pytorch
license: mit
---

# Unrelated Questions Deception Probe for meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_reward_wireheading

A probe trained to detect deceptive behaviour in **meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_reward_wireheading** using logprob differences from yes/no follow-up questions, following the methodology from [How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions (Pacchiardi et al., ICLR 2024)](https://arxiv.org/abs/2310.09492). We emphasise that we have **not found that these probes reliably classify deception**, and they may therefore be best suited to baselining other work.

## Quick Start

```python
from deception import get_probe_hf  # Will work when open sourced

probe = get_probe_hf("ai-safety-institute/uq-meta-llama-llama-3.3-70b-instruct__aa-kto-reward_wireheading")
```

The default checkpoint is the best performer from the hyperparameter sweep (`lm_0_1_ar_lr.pt`). To pick a specific checkpoint, pass `filename=`:

```python
probe = get_probe_hf("ai-safety-institute/uq-meta-llama-llama-3.3-70b-instruct__aa-kto-reward_wireheading", filename="ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")
```

Use `sweep.json` to see all 40 available checkpoints and their metrics.

### Computing Features

Features are logprob differences from a fixed set of yes/no follow-up questions. For each sample, after the model produces its response:

1. Append each follow-up question as a new user message.
2. Generate 1 token with e.g. `logprobs=True, top_logprobs=20` if using vLLM.
3. Compute `logsumexp(yes_logprobs) - logsumexp(no_logprobs)` for yes/no token variants.
4. Stack into a feature vector.

The question list lives in [deception/detectors/unrelated_questions/questions.py](https://github.com/AISI/deception/blob/main/deception/detectors/unrelated_questions/questions.py).

## Model Details

| Property | Value |
| --- | --- |
| Target model | `meta-llama/Llama-3.3-70B-Instruct:auditing-agents/llama_70b_synth_docs_only_then_redteam_kto_reward_wireheading` |
| Safe name | `meta-llama--Llama-3.3-70B-Instruct@auditing-agents--llama_70b_synth_docs_only_then_redteam_kto_reward_wireheading` |
| Default checkpoint | `lm_0_1_ar_lr.pt` |
| Available checkpoints | 40 |
| Calibration | Threshold set at 1% FPR on Alpaca (honest baseline) |

## Training Data

Probes are trained on the [SciQ](https://huggingface.co/datasets/allenai/sciq) dataset ([CC-BY-NC-3.0 license](https://spdx.org/licenses/CC-BY-NC-3.0); model asked factual questions, with some answers being deceptive) and calibrated on [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (honest-only baseline) to achieve a 1% false positive rate.

## Citation

### Original Paper

```bibtex
@inproceedings{pacchiardi2024catchailiar,
      title={How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions},
      author={Lorenzo Pacchiardi and Alex J. Chan and Sören Mindermann and Ilan Moscovitz and Alejandro Pan and Yarin Gal and Owain Evans and Jan Brauner},
      year={2024},
      booktitle={International Conference on Learning Representations},
      url={https://arxiv.org/abs/2309.15840},
}
```

### Trained Probes

```bibtex
@misc{cooney2025uqprobes,
      title={Unrelated Questions Deception Probes},
      author={Alan Cooney},
      year={2025},
      url={https://huggingface.co/collections/ai-safety-institute/catch-a-liar-unrelated-questions-classifier},
}
```