---
library_name: peek
license: cc-by-nc-sa-4.0
pipeline_tag: other
tags:
- video
- frame-selection
- frame-sampling
- video-captioning
- knowledge-distillation
- video-language-model
---

# PEEK: Picking Essential frames via Efficient Knowledge distillation

This repository hosts the **pretrained weights** for [**PEEK: Picking Essential frames via Efficient Knowledge distillation**](https://arxiv.org/abs/2605.31029), a query-free frame selector for low-budget video captioning.

- **Project Page:** https://www.killian-steunou.com/peek
- **Code (Apache-2.0):** https://github.com/momentslab/peek
- **Weights (this repo, CC-BY-NC-SA-4.0):** `peek_base.safetensors`

## What this is

PEEK receives only video frames at inference time. Given a budget of `k` frames it predicts per-frame relevance scores and returns the selected frames in temporal order, ready to be forwarded to a downstream Video-Language Model.

The released `peek_base` checkpoint is a 2-layer Transformer (~1.7M parameters)
trained on **ActivityNet Captions** `train.json` by distilling a frozen,
caption-conditioned SigLIP2 teacher into a query-free scorer over frozen
[MobileCLIP2-S0](https://huggingface.co/apple/MobileCLIP2-S0) frame embeddings.

## Usage

Install the (Apache-licensed) code, then let it pull these weights
automatically:

```bash
pip install git+https://github.com/momentslab/peek
```

```python
from pathlib import Path
from peek.inference import load_peek_pipeline, select_frames_from_video

# Downloads peek_base.safetensors from this repo on first use.
encoder, scorer, device = load_peek_pipeline(device="cuda")

output = select_frames_from_video(
    Path("video.mp4"),
    encoder=encoder, scorer=scorer, device=device,
    k=4, fps=2.0,
)
print(output.selected_indices, output.selected_timestamps_sec)
```

Or from the CLI:

```bash
python scripts/infer.py path/to/video.mp4 --k 4
```

## Files

| File | Description |
| --- | --- |
| `peek_base.safetensors` | PEEK scorer weights (model-only, ~7 MB). |
| `config.json` | Architecture and training metadata. |
| `LICENSE` | CC-BY-NC-SA-4.0 legal text. |

## Training details

- **Train data:** ActivityNet Captions `train.json` segments only.
- **Encoder:** MobileCLIP2-S0 (frozen), 512-d features per frame, via `open_clip`.
- **Teacher:** SigLIP2 SO400M patch14 384 (frozen).
- **Loss:** ListMLE on min-max normalized teacher cosine similarities.
- **Inference selection policy:** stratified argmax.

## Citation

```bibtex
@inproceedings{steunou2026peek,
  title={{PEEK}: Picking Essential frames via Efficient Knowledge distillation},
  author={Steunou, Killian and Filali Razzouki, Anas and Guetari, Khalil and El-Yacoubi, Moun\^im A. and Tevissen, Yannis},
  year={2026},
  url={https://arxiv.org/abs/2605.31029}
}
```