File size: 2,778 Bytes
e59be2b be6a7da b3ef71b be6a7da b3ef71b e59be2b be6a7da b3ef71b be6a7da b3ef71b be6a7da b3ef71b be6a7da 4cc135b be6a7da 557b0a1 be6a7da b3ef71b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | ---
library_name: peek
license: cc-by-nc-sa-4.0
pipeline_tag: other
tags:
- video
- frame-selection
- frame-sampling
- video-captioning
- knowledge-distillation
- video-language-model
---
# PEEK: Picking Essential frames via Efficient Knowledge distillation
This repository hosts the **pretrained weights** for [**PEEK: Picking Essential frames via Efficient Knowledge distillation**](https://arxiv.org/abs/2605.31029), a query-free frame selector for low-budget video captioning.
- **Project Page:** https://www.killian-steunou.com/peek
- **Code (Apache-2.0):** https://github.com/momentslab/peek
- **Weights (this repo, CC-BY-NC-SA-4.0):** `peek_base.safetensors`
## What this is
PEEK receives only video frames at inference time. Given a budget of `k` frames it predicts per-frame relevance scores and returns the selected frames in temporal order, ready to be forwarded to a downstream Video-Language Model.
The released `peek_base` checkpoint is a 2-layer Transformer (~1.7M parameters)
trained on **ActivityNet Captions** `train.json` by distilling a frozen,
caption-conditioned SigLIP2 teacher into a query-free scorer over frozen
[MobileCLIP2-S0](https://huggingface.co/apple/MobileCLIP2-S0) frame embeddings.
## Usage
Install the (Apache-licensed) code, then let it pull these weights
automatically:
```bash
pip install git+https://github.com/momentslab/peek
```
```python
from pathlib import Path
from peek.inference import load_peek_pipeline, select_frames_from_video
# Downloads peek_base.safetensors from this repo on first use.
encoder, scorer, device = load_peek_pipeline(device="cuda")
output = select_frames_from_video(
Path("video.mp4"),
encoder=encoder, scorer=scorer, device=device,
k=4, fps=2.0,
)
print(output.selected_indices, output.selected_timestamps_sec)
```
Or from the CLI:
```bash
python scripts/infer.py path/to/video.mp4 --k 4
```
## Files
| File | Description |
| --- | --- |
| `peek_base.safetensors` | PEEK scorer weights (model-only, ~7 MB). |
| `config.json` | Architecture and training metadata. |
| `LICENSE` | CC-BY-NC-SA-4.0 legal text. |
## Training details
- **Train data:** ActivityNet Captions `train.json` segments only.
- **Encoder:** MobileCLIP2-S0 (frozen), 512-d features per frame, via `open_clip`.
- **Teacher:** SigLIP2 SO400M patch14 384 (frozen).
- **Loss:** ListMLE on min-max normalized teacher cosine similarities.
- **Inference selection policy:** stratified argmax.
## Citation
```bibtex
@inproceedings{steunou2026peek,
title={{PEEK}: Picking Essential frames via Efficient Knowledge distillation},
author={Steunou, Killian and Filali Razzouki, Anas and Guetari, Khalil and El-Yacoubi, Moun\^im A. and Tevissen, Yannis},
year={2026},
url={https://arxiv.org/abs/2605.31029}
}
``` |