Unrelated Questions Deception Probe for google/gemma-2-9b-it:bcywinski/gemma-2-9b-it-user-female

A probe trained to detect deceptive behaviour in google/gemma-2-9b-it:bcywinski/gemma-2-9b-it-user-female using logprob differences from yes/no follow-up questions, following the methodology from How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions (Pacchiardi et al., ICLR 2024). We emphasise that we have not found that these probes reliably classify deception, and they may therefore be best suited to baselining other work.

Quick Start

from deception import get_probe_hf  # Will work when open sourced

probe = get_probe_hf("ai-safety-institute/uq-google-gemma-2-9b-it__bcywinski-gemma-2-9b-it-user-female")

The default checkpoint is the best performer from the hyperparameter sweep (lm_10_ar_lr.pt). To pick a specific checkpoint, pass filename=:

probe = get_probe_hf("ai-safety-institute/uq-google-gemma-2-9b-it__bcywinski-gemma-2-9b-it-user-female", filename="ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")

Use sweep.json to see all 40 available checkpoints and their metrics.

Computing Features

Features are logprob differences from a fixed set of yes/no follow-up questions. For each sample, after the model produces its response:

Append each follow-up question as a new user message.
Generate 1 token with e.g. logprobs=True, top_logprobs=20 if using vLLM.
Compute logsumexp(yes_logprobs) - logsumexp(no_logprobs) for yes/no token variants.
Stack into a feature vector.

The question list lives in deception/detectors/unrelated_questions/questions.py.

Model Details

Property	Value
Target model	`google/gemma-2-9b-it:bcywinski/gemma-2-9b-it-user-female`
Safe name	`google--gemma-2-9b-it@bcywinski--gemma-2-9b-it-user-female`
Default checkpoint	`lm_10_ar_lr.pt`
Available checkpoints	40
Calibration	Threshold set at 1% FPR on Alpaca (honest baseline)

Training Data

Probes are trained on the SciQ dataset (CC-BY-NC-3.0 license; model asked factual questions, with some answers being deceptive) and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.

Citation

Original Paper

@inproceedings{pacchiardi2024catchailiar,
      title={How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions},
      author={Lorenzo Pacchiardi and Alex J. Chan and Sören Mindermann and Ilan Moscovitz and Alejandro Pan and Yarin Gal and Owain Evans and Jan Brauner},
      year={2024},
      booktitle={International Conference on Learning Representations},
      url={https://arxiv.org/abs/2309.15840},
}

Trained Probes

@misc{cooney2025uqprobes,
      title={Unrelated Questions Deception Probes},
      author={Alan Cooney},
      year={2025},
      url={https://huggingface.co/collections/ai-safety-institute/catch-a-liar-unrelated-questions-classifier},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ai-safety-institute/uq-google-gemma-2-9b-it__bcywinski-gemma-2-9b-it-user-female

Catch a Liar: Unrelated Questions Classifier

Collection

Classifiers trained following the approach in How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions • 56 items • Updated about 11 hours ago

Papers for ai-safety-institute/uq-google-gemma-2-9b-it__bcywinski-gemma-2-9b-it-user-female

Perception Reinforcement Using Auxiliary Learning Feature Fusion: A Modified Yolov8 for Head Detection

Paper • 2310.09492 • Published Oct 14, 2023

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Paper • 2309.15840 • Published Sep 26, 2023