--- library_name: peek license: cc-by-nc-sa-4.0 pipeline_tag: other tags: - video - frame-selection - frame-sampling - video-captioning - knowledge-distillation - video-language-model --- # PEEK: Picking Essential frames via Efficient Knowledge distillation This repository hosts the **pretrained weights** for [**PEEK: Picking Essential frames via Efficient Knowledge distillation**](https://arxiv.org/abs/2605.31029), a query-free frame selector for low-budget video captioning. - **Project Page:** https://www.killian-steunou.com/peek - **Code (Apache-2.0):** https://github.com/momentslab/peek - **Weights (this repo, CC-BY-NC-SA-4.0):** `peek_base.safetensors` ## What this is PEEK receives only video frames at inference time. Given a budget of `k` frames it predicts per-frame relevance scores and returns the selected frames in temporal order, ready to be forwarded to a downstream Video-Language Model. The released `peek_base` checkpoint is a 2-layer Transformer (~1.7M parameters) trained on **ActivityNet Captions** `train.json` by distilling a frozen, caption-conditioned SigLIP2 teacher into a query-free scorer over frozen [MobileCLIP2-S0](https://huggingface.co/apple/MobileCLIP2-S0) frame embeddings. ## Usage Install the (Apache-licensed) code, then let it pull these weights automatically: ```bash pip install git+https://github.com/momentslab/peek ``` ```python from pathlib import Path from peek.inference import load_peek_pipeline, select_frames_from_video # Downloads peek_base.safetensors from this repo on first use. encoder, scorer, device = load_peek_pipeline(device="cuda") output = select_frames_from_video( Path("video.mp4"), encoder=encoder, scorer=scorer, device=device, k=4, fps=2.0, ) print(output.selected_indices, output.selected_timestamps_sec) ``` Or from the CLI: ```bash python scripts/infer.py path/to/video.mp4 --k 4 ``` ## Files | File | Description | | --- | --- | | `peek_base.safetensors` | PEEK scorer weights (model-only, ~7 MB). | | `config.json` | Architecture and training metadata. | | `LICENSE` | CC-BY-NC-SA-4.0 legal text. | ## Training details - **Train data:** ActivityNet Captions `train.json` segments only. - **Encoder:** MobileCLIP2-S0 (frozen), 512-d features per frame, via `open_clip`. - **Teacher:** SigLIP2 SO400M patch14 384 (frozen). - **Loss:** ListMLE on min-max normalized teacher cosine similarities. - **Inference selection policy:** stratified argmax. ## Citation ```bibtex @inproceedings{steunou2026peek, title={{PEEK}: Picking Essential frames via Efficient Knowledge distillation}, author={Steunou, Killian and Filali Razzouki, Anas and Guetari, Khalil and El-Yacoubi, Moun\^im A. and Tevissen, Yannis}, year={2026}, url={https://arxiv.org/abs/2605.31029} } ```