Unrelated Questions Deception Probe for google/gemma-2-9b-it:bcywinski/gemma-2-9b-it-user-female

A probe trained to detect deceptive behaviour in google/gemma-2-9b-it:bcywinski/gemma-2-9b-it-user-female using logprob differences from yes/no follow-up questions, following the methodology from How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions (Pacchiardi et al., ICLR 2024). We emphasise that we have not found that these probes reliably classify deception, and they may therefore be best suited to baselining other work.

Quick Start

from deception import get_probe_hf  # Will work when open sourced

probe = get_probe_hf("ai-safety-institute/uq-google-gemma-2-9b-it__bcywinski-gemma-2-9b-it-user-female")

The default checkpoint is the best performer from the hyperparameter sweep (lm_10_ar_lr.pt). To pick a specific checkpoint, pass filename=:

probe = get_probe_hf("ai-safety-institute/uq-google-gemma-2-9b-it__bcywinski-gemma-2-9b-it-user-female", filename="ar_mlp_wd_0_001_lr_0_0001_ep_100.pt")

Use sweep.json to see all 40 available checkpoints and their metrics.

Computing Features

Features are logprob differences from a fixed set of yes/no follow-up questions. For each sample, after the model produces its response:

  1. Append each follow-up question as a new user message.
  2. Generate 1 token with e.g. logprobs=True, top_logprobs=20 if using vLLM.
  3. Compute logsumexp(yes_logprobs) - logsumexp(no_logprobs) for yes/no token variants.
  4. Stack into a feature vector.

The question list lives in deception/detectors/unrelated_questions/questions.py.

Model Details

Property Value
Target model google/gemma-2-9b-it:bcywinski/gemma-2-9b-it-user-female
Safe name google--gemma-2-9b-it@bcywinski--gemma-2-9b-it-user-female
Default checkpoint lm_10_ar_lr.pt
Available checkpoints 40
Calibration Threshold set at 1% FPR on Alpaca (honest baseline)

Training Data

Probes are trained on the SciQ dataset (CC-BY-NC-3.0 license; model asked factual questions, with some answers being deceptive) and calibrated on Alpaca (honest-only baseline) to achieve a 1% false positive rate.

Citation

Original Paper

@inproceedings{pacchiardi2024catchailiar,
      title={How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions},
      author={Lorenzo Pacchiardi and Alex J. Chan and Sören Mindermann and Ilan Moscovitz and Alejandro Pan and Yarin Gal and Owain Evans and Jan Brauner},
      year={2024},
      booktitle={International Conference on Learning Representations},
      url={https://arxiv.org/abs/2309.15840},
}

Trained Probes

@misc{cooney2025uqprobes,
      title={Unrelated Questions Deception Probes},
      author={Alan Cooney},
      year={2025},
      url={https://huggingface.co/collections/ai-safety-institute/catch-a-liar-unrelated-questions-classifier},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ai-safety-institute/uq-google-gemma-2-9b-it__bcywinski-gemma-2-9b-it-user-female

Papers for ai-safety-institute/uq-google-gemma-2-9b-it__bcywinski-gemma-2-9b-it-user-female