Jina AI commited on 6 days ago

Commit

2b230c9

0 Parent(s):

Initial public release

Files changed (27) hide show

.gitattributes +37 -0
README.md +260 -0
adapters/classification/adapter_config.json +39 -0
adapters/classification/adapter_model.safetensors +3 -0
adapters/clustering/adapter_config.json +39 -0
adapters/clustering/adapter_model.safetensors +3 -0
adapters/retrieval/adapter_config.json +39 -0
adapters/retrieval/adapter_model.safetensors +3 -0
adapters/text-matching/adapter_config.json +39 -0
adapters/text-matching/adapter_model.safetensors +3 -0
architecture.png +3 -0
chat_template.jinja +154 -0
config.json +101 -0
config_sentence_transformers.json +8 -0
custom_st.py +990 -0
model.safetensors +3 -0
modeling_jina_embeddings_v5_omni.py +616 -0
modeling_llava_eurobert_audio.py +400 -0
modules.json +8 -0
preprocessor_config.json +24 -0
processing_llava_eurobert.py +65 -0
processor_config.json +68 -0
tokenizer.json +3 -0
tokenizer_config.json +23 -0
video_preprocessor_config.json +29 -0
vllm_jina_v5_omni.py +175 -0
vllm_llava_eurobert_audio.py +889 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,37 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+architecture.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,260 @@

+---
+pipeline_tag: sentence-similarity
+tags:
+- embedding
+- jina-embeddings-v5
+- feature-extraction
+- sentence-transformers
+- multimodal
+- vision
+- audio
+- vllm
+language:
+- multilingual
+inference: false
+license: cc-by-nc-4.0
+library_name: transformers
+---
+### **jina-embeddings-v5-omni-nano**: Multi-Task Omni Embedding Base (Nano)
+![Architecture](architecture.png)
+### Model Overview
+`jina-embeddings-v5-omni-nano` is a multimodal embedding model that accepts **text, images, video, and audio** and produces embeddings in a shared vector space aligned with the text-only [`jinaai/jina-embeddings-v5-text-nano`](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano) — so you can index with text and query with any modality, no reindexing.
+This is the **base** repository — it holds all task adapters (retrieval, classification, clustering, text-matching). For a single task, pre-merged task-specific variants are also available:
+- [`jinaai/jina-embeddings-v5-omni-nano-classification`](https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano-classification)
+- [`jinaai/jina-embeddings-v5-omni-nano-clustering`](https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano-clustering)
+- [`jinaai/jina-embeddings-v5-omni-nano-retrieval`](https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano-retrieval)
+- [`jinaai/jina-embeddings-v5-omni-nano-text-matching`](https://huggingface.co/jinaai/jina-embeddings-v5-omni-nano-text-matching)
+| Feature | Value |
+| --- | --- |
+| Parameters | ~1.04B |
+| Embedding Dimension | 768 |
+| Supported Tasks | `retrieval`, `classification`, `clustering`, `text-matching` |
+| Max Sequence Length | 8192 |
+| Pooling Strategy | Last-token |
+| Supported Inputs | text, image, video, audio |
+| Supported File Types | images: `.jpg`, `.jpeg`, `.png`, `.gif`, `.webp`, `.bmp`, `.tif`, `.tiff`, `.avif`, `.heic`, `.svg`; video: `.mp4`, `.avi`, `.mov`, `.mkv`, `.webm`, `.flv`, `.wmv`; audio: `.wav`, `.mp3`, `.flac`, `.ogg`, `.m4a`, `.opus`; documents: `.pdf` |
+### Install
+```bash
+# core
+pip install transformers torch pillow numpy
+# Optional — install only the extras for the modalities you actually use:
+pip install librosa soundfile      # audio decoding
+pip install av imageio             # video decoding (pure-Python, no codec daemon)
+pip install pdf2image pypdfium2    # PDF rendering
+pip install cairosvg pillow    # SVG rendering
+pip install "vllm==0.20.1"        # high-throughput serving (validated)
+pip install sentence-transformers  # one-call multimodal API
+```
+For minimum versions see the Requirements section below (transformers >= 4.57, torch >= 2.5; vLLM path validated with vllm == 0.20.1).
+### Quickstart
+```python
+from PIL import Image
+import librosa, torch
+from transformers import AutoModel, AutoProcessor, WhisperFeatureExtractor
+repo = "jinaai/jina-embeddings-v5-omni-nano"
+model = AutoModel.from_pretrained(repo, trust_remote_code=True, default_task="retrieval").eval()
+proc  = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
+# model.embed(**inputs) returns L2-normalized last-token embeddings.
+t_vec = model.embed(**proc(text="Query: Which planet is known as the Red Planet?", return_tensors="pt").to(model.device))
+i_vec = model.embed(**proc(images=Image.open("photo.jpg"), text="<image>", return_tensors="pt").to(model.device))
+v_vec = model.embed(**proc(videos="clip.mp4", text="<image>", return_tensors="pt").to(model.device))
+# Audio has no string placeholder — build token ids from config.
+audio, _ = librosa.load("speech.wav", sr=16000)
+feat = WhisperFeatureExtractor(feature_size=128)(audio, sampling_rate=16000, return_tensors="pt")["input_features"]
+cfg = model.config
+n   = feat.shape[-1] // 4
+ids = torch.tensor([[cfg.audio_start_token_id, *[cfg.audio_token_id]*n, cfg.audio_end_token_id]])
+a_vec = model.embed(
+    input_ids=ids.to(model.device),
+    attention_mask=torch.ones_like(ids).to(model.device),
+    input_features=feat.to(model.device, dtype=next(model.parameters()).dtype),
+)
+```
+For retrieval, use `encode_query()` for query-side embeddings and `encode_document()` for document-side embeddings. A bare `encode(text)` call does not know which retrieval side you intended.
+No `dtype`, `device`, `min_pixels`, or custom pooling code needed — sensible defaults live in the model config (bf16 weights, 256–1280 vision tokens).
+<details>
+<summary>Requirements</summary>
+- `transformers>=4.57` (recommend >=5.1 for the small variants)
+- `torch>=2.5`
+Optional:
+- `sentence-transformers` — one-call API for all 4 modalities
+- `librosa` — audio decoding
+- `av` — video decoding (`pip install av`)
+- `vllm==0.20.1` — high-throughput serving; H100 deployments may also need DeepGEMM installed for vLLM FP8 kernels
+</details>
+### Selective Modality Loading
+By default all components (vision + audio towers + text encoder) are loaded.
+To save memory, pick a subset — the unused towers are skipped at load time:
+```python
+from transformers import AutoModel
+AutoModel.from_pretrained("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True, modality="omni")    # all (default)
+AutoModel.from_pretrained("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True, modality="vision")  # vision + text
+AutoModel.from_pretrained("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True, modality="audio")   # audio + text
+AutoModel.from_pretrained("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True, modality="text")    # text only
+```
+Same parameter works via `sentence-transformers`:
+```python
+SentenceTransformer("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True, model_kwargs={"modality": "vision"})
+```
+### Via sentence-transformers
+```python
+from sentence_transformers import SentenceTransformer
+# Base repo holds all 4 task adapters — pick one at load time.
+model = SentenceTransformer(
+    "jinaai/jina-embeddings-v5-omni-nano",
+    trust_remote_code=True,
+    model_kwargs={"default_task": "retrieval"},
+)
+# URLs, local paths (with or without extension), PIL.Image, np.ndarray,
+# torch.Tensor, bytes, and BytesIO are all accepted directly.
+q_vec = model.encode_query("Which planet is known as the Red Planet?")
+d_vec = model.encode_document("Mars is often referred to as the Red Planet due to its reddish appearance.")
+i_vec = model.encode("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg")
+v_vec = model.encode("https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/sample_demo_1.mp4")  # needs `pip install av`
+a_vec = model.encode("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")  # needs `pip install librosa soundfile`
+# Fused multimodal — a tuple becomes ONE embedding in a single forward pass:
+emb = model.encode(("Winter boots, waterproof leather upper",
+                    "https://.../boot.jpg",
+                    "https://.../boot.mp4"))
+```
+For non-retrieval tasks (classification / clustering / text-matching), reload
+with the corresponding `default_task` — no `prompt_name` needed.
+No `dtype`, `device`, `min_pixels`, or custom pooling code needed — sensible defaults live in the model config (bf16 weights, 256–1280 vision tokens).
+<!-- VIDEO_INPUT_TYPES_DETAILS -->
+<details><summary>Accepted video inputs</summary>
+Path (`.mp4 .avi .mov .mkv .webm .flv .wmv`, or extensionless — content-sniffed), HTTP(S) URL, `bytes`/`io.BytesIO`, and in-memory `np.ndarray` / `torch.Tensor` of shape `(T, H, W, 3|4)` with dtype `uint8`. In-memory arrays are encoded to MP4 on the fly (needs `pip install imageio imageio-ffmpeg`).
+```python
+import numpy as np
+# (T, H, W, 3) uint8 — e.g. from decord, imageio, or an rgb frame buffer
+frames = np.zeros((16, 224, 224, 3), dtype=np.uint8)
+v_vec = model.encode(frames)
+```
+</details>
+### Via vLLM
+The base repo holds all 4 task adapters. Pick **one task per vLLM instance** via `hf_overrides`:
+```python
+from vllm import LLM
+llm = LLM(
+    model="jinaai/jina-embeddings-v5-omni-nano",
+    runner="pooling",
+    trust_remote_code=True,
+    hf_overrides={"task": "retrieval"},  # or: classification / clustering / text-matching
+)
+outs = llm.embed([{"prompt": "Which planet is known as the Red Planet?"}])
+```
+Or via CLI:
+```bash
+vllm serve jinaai/jina-embeddings-v5-omni-nano \
+  --trust-remote-code \
+  --hf-overrides '{"task": "retrieval"}'
+```
+Alternatively set `JINA_V5_TASK=retrieval` in the environment. Output is bit-exact
+with the corresponding pre-merged `-retrieval` / `-classification` / `-clustering` /
+`-text-matching` variant.
+### Matryoshka (truncating embeddings)
+All three backends support truncating the full embedding to a shorter dimension
+with L2 re-normalization, so the result stays unit-norm:
+```python
+# transformers
+vec = model.embed(truncate_dim=256, **proc(text="hello", return_tensors="pt"))
+# or
+vec = model.encode(["hello"], task="retrieval", truncate_dim=256)
+# sentence-transformers
+vec = model.encode("hello", truncate_dim=256)
+# vLLM — ask the pooler for a smaller embedding
+from vllm import PoolingParams
+outs = llm.embed(prompts, pooling_params=PoolingParams(dimensions=256))
+# or truncate + renormalize the full-dim output yourself:
+import numpy as np
+full = np.asarray(outs[0].outputs.embedding)
+vec = full[:256] / np.linalg.norm(full[:256])
+```
+<!-- BATCHING_SECTION_START -->
+### Batching
+Pass a list to encode many inputs in one call.
+```python
+# sentence-transformers — any modality
+t_vecs = model.encode(["query 1", "query 2"])
+i_vecs = model.encode([Image.open("a.jpg"), Image.open("b.jpg")])
+v_vecs = model.encode(["clip1.mp4", "clip2.mp4"])
+a_vecs = model.encode(["speech1.wav", "speech2.wav"])
+# raw transformers — text (native padded batch)
+inputs = proc(text=["query 1", "query 2"], padding=True, truncation=True, return_tensors="pt").to(model.device)
+vecs = model.embed(**inputs)  # (2, dim)
+# vLLM — list of request dicts, any modality
+outs = llm.embed([
+    {"prompt": "query 1"},
+    {"prompt": "query 2"},
+])
+```
+For `sentence-transformers`, images / video / audio are forwarded per-sample (one forward pass each). Text is truly batched. For large-scale multimodal throughput, prefer `vLLM`.
+<!-- BATCHING_SECTION_END -->
+### Compatibility
+Embeddings produced by this model share a vector space with:
+- [`jinaai/jina-embeddings-v5-text-nano`](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano) — text-only
+- `jinaai/jina-embeddings-v5-text-nano` (via matching adapter)
+You can index text with the `v5-text-nano` model and query it with image,
+video, or audio embeddings from `jina-embeddings-v5-omni-nano` — no reindexing.
+### License
+CC BY-NC 4.0. For commercial use, [contact us](mailto:sales@jina.ai).

adapters/classification/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "jinaai/jina-embeddings-v5-omni-nano",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": "gaussian",
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "k_proj",
+    "q_proj",
+    "down_proj",
+    "gate_proj",
+    "v_proj",
+    "up_proj"
+  ],
+  "task_type": "FEATURE_EXTRACTION",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

adapters/classification/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:13c8389381221af49bbe1d40231f50b28e354c901af57e6b1a1b3a6ec34f42b2
+size 13589512

adapters/clustering/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "jinaai/jina-embeddings-v5-omni-nano",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": "gaussian",
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.1,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "q_proj",
+    "down_proj",
+    "gate_proj",
+    "v_proj",
+    "o_proj",
+    "k_proj",
+    "up_proj"
+  ],
+  "task_type": "FEATURE_EXTRACTION",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

adapters/clustering/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:030822246d646a74a9886410eddcc80c663384bcaeacd31a869a523f35268c5f
+size 13589512

adapters/retrieval/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "jinaai/jina-embeddings-v5-omni-nano",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": "gaussian",
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.1,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "gate_proj",
+    "v_proj",
+    "q_proj",
+    "down_proj",
+    "o_proj",
+    "k_proj",
+    "up_proj"
+  ],
+  "task_type": "FEATURE_EXTRACTION",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

adapters/retrieval/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c2b4bd34101afd04833c5626958724e7587c292ffc6564788cfa10af89a2157
+size 13589512

adapters/text-matching/adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "jinaai/jina-embeddings-v5-omni-nano",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": "gaussian",
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.1,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "k_proj",
+    "v_proj",
+    "o_proj",
+    "down_proj",
+    "q_proj",
+    "gate_proj",
+    "up_proj"
+  ],
+  "task_type": "FEATURE_EXTRACTION",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_rslora": false
+}

adapters/text-matching/adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:24bb801dbc7e565e9a64e12ef0049fc2631c381f50a215a83cf90fe0ccba2e0e
+size 13589512

architecture.png ADDED Viewed

Git LFS Details

SHA256: d19de1304bc3b370b3c5af213dd205ccfe42177b38888feb966605556ee6b721
Pointer size: 132 Bytes
Size of remote file: 1.24 MB

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,154 @@

+{%- set image_count = namespace(value=0) %}
+{%- set video_count = namespace(value=0) %}
+{%- macro render_content(content, do_vision_count, is_system_content=false) %}
+    {%- if content is string %}
+        {{- content }}
+    {%- elif content is iterable and content is not mapping %}
+        {%- for item in content %}
+            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain images.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
+            {%- elif 'video' in item or item.type == 'video' %}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain videos.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'Video ' ~ video_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
+            {%- elif 'text' in item %}
+                {{- item.text }}
+            {%- else %}
+                {{- raise_exception('Unexpected item type in content.') }}
+            {%- endif %}
+        {%- endfor %}
+    {%- elif content is none or content is undefined %}
+        {{- '' }}
+    {%- else %}
+        {{- raise_exception('Unexpected content type.') }}
+    {%- endif %}
+{%- endmacro %}
+{%- if not messages %}
+    {{- raise_exception('No messages provided.') }}
+{%- endif %}
+{%- if tools and tools is iterable and tools is not mapping %}
+    {{- '<|im_start|>system\n' }}
+    {{- "# Tools\n\nYou have access to the following functions:\n\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>" }}
+    {{- '\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags\n- Required parameters MUST be specified\n- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after\n- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls\n</IMPORTANT>' }}
+    {%- if messages[0].role == 'system' %}
+        {%- set content = render_content(messages[0].content, false, true)|trim %}
+        {%- if content %}
+            {{- '\n\n' + content }}
+        {%- endif %}
+    {%- endif %}
+    {{- '<|im_end|>\n' }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {%- set content = render_content(messages[0].content, false, true)|trim %}
+        {{- '<|im_start|>system\n' + content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" %}
+        {%- set content = render_content(message.content, false)|trim %}
+        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
+            {%- set ns.multi_step_tool = false %}
+            {%- set ns.last_query_index = index %}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if ns.multi_step_tool %}
+    {{- raise_exception('No user query found in messages.') }}
+{%- endif %}
+{%- for message in messages %}
+    {%- set content = render_content(message.content, true)|trim %}
+    {%- if message.role == "system" %}
+        {%- if not loop.first %}
+            {{- raise_exception('System message must be at the beginning.') }}
+        {%- endif %}
+    {%- elif message.role == "user" %}
+        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- set reasoning_content = reasoning_content|trim %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if tool_call.function is defined %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {%- if loop.first %}
+                    {%- if content|trim %}
+                        {{- '\n\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                    {%- else %}
+                        {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                    {%- endif %}
+                {%- else %}
+                    {{- '\n<tool_call>\n<function=' + tool_call.name + '>\n' }}
+                {%- endif %}
+                {%- if tool_call.arguments is defined %}
+                    {%- for args_name, args_value in tool_call.arguments|items %}
+                        {{- '<parameter=' + args_name + '>\n' }}
+                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
+                        {{- args_value }}
+                        {{- '\n</parameter>\n' }}
+                    {%- endfor %}
+                {%- endif %}
+                {{- '</function>\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.previtem and loop.previtem.role != "tool" %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- content }}
+        {{- '\n</tool_response>' }}
+        {%- if not loop.last and loop.nextitem.role != "tool" %}
+            {{- '<|im_end|>\n' }}
+        {%- elif loop.last %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- else %}
+        {{- raise_exception('Unexpected message role.') }}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if enable_thinking is defined and enable_thinking is true %}
+        {{- '<think>\n' }}
+    {%- else %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- endif %}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,101 @@

+{
+  "architectures": [
+    "JinaEmbeddingsV5OmniModel"
+  ],
+  "auto_map": {
+    "AutoConfig": "modeling_jina_embeddings_v5_omni.JinaEmbeddingsV5OmniConfig",
+    "AutoModel": "modeling_jina_embeddings_v5_omni.JinaEmbeddingsV5OmniModel"
+  },
+  "model_type": "jina_embeddings_v5_omni",
+  "task_names": [
+    "retrieval",
+    "text-matching",
+    "clustering",
+    "classification"
+  ],
+  "special_token_ids": [
+    128256,
+    128257,
+    128258,
+    128259
+  ],
+  "vision_config": {
+    "deepstack_visual_indexes": [],
+    "depth": 12,
+    "dtype": "bfloat16",
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 768,
+    "in_channels": 3,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "model_type": "",
+    "num_heads": 12,
+    "num_position_embeddings": 2304,
+    "out_hidden_size": 1024,
+    "patch_size": 16,
+    "spatial_merge_size": 2,
+    "temporal_patch_size": 2
+  },
+  "text_config": {
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "head_dim": 64,
+    "hidden_act": "silu",
+    "hidden_size": 768,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "is_causal": false,
+    "max_position_embeddings": 8192,
+    "mlp_bias": false,
+    "model_type": "",
+    "num_attention_heads": 12,
+    "num_hidden_layers": 12,
+    "num_key_value_heads": 12,
+    "pad_token_id": null,
+    "pretraining_tp": 1,
+    "rms_norm_eps": 1e-05,
+    "rope_parameters": {
+      "rope_theta": 1000000.0,
+      "rope_type": "default"
+    },
+    "tie_word_embeddings": false,
+    "vocab_size": 128260
+  },
+  "audio_config": {
+    "activation_dropout": 0.0,
+    "activation_function": "gelu",
+    "attention_dropout": 0.0,
+    "d_model": 1280,
+    "dropout": 0.0,
+    "dtype": "float32",
+    "encoder_attention_heads": 20,
+    "encoder_ffn_dim": 5120,
+    "encoder_layers": 32,
+    "initializer_range": 0.02,
+    "max_source_positions": 1500,
+    "num_mel_bins": 128,
+    "scale_embedding": false,
+    "n_window": 100,
+    "output_dim": 3584
+  },
+  "image_token_index": 128259,
+  "audio_token_id": 128256,
+  "audio_start_token_id": 128257,
+  "audio_end_token_id": 128258,
+  "projector_hidden_act": "gelu",
+  "tie_word_embeddings": false,
+  "dtype": "bfloat16",
+  "transformers_version": "5.4.0",
+  "torch_dtype": "bfloat16",
+  "is_matryoshka": true,
+  "matryoshka_dimensions": [
+    32,
+    64,
+    128,
+    256,
+    512,
+    768
+  ]
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "prompts": {
+    "query": "Query: ",
+    "document": "Document: "
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

custom_st.py ADDED Viewed

	@@ -0,0 +1,990 @@

+"""Sentence-transformers integration for jina-embeddings-v5-omni-nano (base + LoRA).
+Supports text, image, video, and audio with per-task adapter routing:
+    from sentence_transformers import SentenceTransformer
+    model = SentenceTransformer(
+        "jinaai/jina-embeddings-v5-omni-nano",
+        trust_remote_code=True,
+        model_kwargs={"default_task": "retrieval"},
+    )
+    q = model.encode("What is ML?", prompt_name="query")
+    d = model.encode("ML is ...", prompt_name="document")
+    img = model.encode(Image.open("photo.jpg"))
+    vid = model.encode("clip.mp4")
+    aud = model.encode("speech.wav")
+"""
+import json
+import os
+from typing import Any, Dict, List, Optional, Union
+import torch
+import torch.nn.functional as F
+from torch import nn
+from transformers import AutoConfig, AutoModel, AutoTokenizer
+MAX_SEQ_LENGTH = 8192
+IMAGE_PROMPT = "<image>"
+VIDEO_PROMPT = "<image>"
+AUDIO_EXTENSIONS = {".wav", ".mp3", ".flac", ".ogg", ".m4a", ".opus", ".webm"}
+VIDEO_EXTENSIONS = {".mp4", ".avi", ".mov", ".mkv", ".webm", ".flv", ".wmv"}
+PDF_EXTENSIONS = {".pdf"}
+SVG_EXTENSIONS = {".svg"}
+PDF_DPI = 150
+TASK_NAMES = ["retrieval", "text-matching", "clustering", "classification"]
+EVAL_IMAGE_MIN_PIXELS = 262144
+EVAL_IMAGE_MAX_PIXELS = 1310720
+EVAL_VIDEO_MAX_PIXELS = 12845056
+EVAL_VIDEO_NUM_FRAMES = 32
+def _pil_image():
+    """Return the PIL.Image module, with a clean ImportError if pillow is not
+    installed. Wrapped in `try` so transformers' AST-based `check_imports`
+    does not list PIL as a top-level required dependency: text-only and
+    audio-only users should not need pillow installed.
+    """
+    try:
+        from PIL import Image as _PILImage
+    except ImportError as e:
+        raise ImportError(
+            "Encoding images or rasterising PDFs needs `pip install pillow`."
+        ) from e
+    return _PILImage
+def _is_image(x) -> bool:
+    try:
+        from PIL import Image as PILImage
+        return isinstance(x, PILImage.Image)
+    except ImportError:
+        return False
+def _is_video_path(x) -> bool:
+    if not isinstance(x, str):
+        return False
+    return any(x.lower().endswith(ext) for ext in VIDEO_EXTENSIONS)
+def _is_audio_path(x) -> bool:
+    if not isinstance(x, str):
+        return False
+    return any(x.lower().endswith(ext) for ext in AUDIO_EXTENSIONS)
+def _is_pdf_path(x) -> bool:
+    if not isinstance(x, str):
+        return False
+    return any(x.lower().endswith(ext) for ext in PDF_EXTENSIONS)
+def _is_svg_path(x) -> bool:
+    if not isinstance(x, str):
+        return False
+    return any(x.lower().split("?", 1)[0].endswith(ext) for ext in SVG_EXTENSIONS)
+def _is_audio_array(x) -> bool:
+    try:
+        import numpy as np
+    except ImportError:
+        return False
+    return isinstance(x, np.ndarray) and x.ndim == 1 and np.issubdtype(x.dtype, np.floating)
+class _AudioWrapper:
+    def __init__(self, array, sampling_rate: int = 16000):
+        self.array = array
+        self.sampling_rate = sampling_rate
+def _download_if_url(x):
+    """If x is an http(s) URL, download to a hashed local cache and return the
+    local path. Otherwise return x unchanged.
+    """
+    if not isinstance(x, str):
+        return x
+    if not (x.startswith("http://") or x.startswith("https://")):
+        return x
+    import hashlib, os, tempfile, urllib.request
+    from urllib.parse import urlparse
+    cache = os.path.join(tempfile.gettempdir(), "jina_omni_media_cache")
+    os.makedirs(cache, exist_ok=True)
+    h = hashlib.sha256(x.encode("utf-8")).hexdigest()[:16]
+    url_path = urlparse(x).path
+    _, ext = os.path.splitext(url_path)
+    local = os.path.join(cache, f"{h}{ext}" if ext else h)
+    if not os.path.isfile(local) or os.path.getsize(local) == 0:
+        urllib.request.urlretrieve(x, local)
+    return local
+def _looks_like_svg(data):
+    if not data:
+        return False
+    head = data[:4096].lstrip().lower()
+    return b"<svg" in head
+def _svg_to_image(svg):
+    try:
+        import cairosvg
+    except ImportError as e:
+        raise ImportError("Encoding SVG images needs `pip install cairosvg pillow`.") from e
+    import io
+    png = cairosvg.svg2png(bytestring=svg if isinstance(svg, (bytes, bytearray)) else None,
+                           url=svg if isinstance(svg, str) else None)
+    _PILImage = _pil_image()
+    return _PILImage.open(io.BytesIO(png)).convert("RGB")
+def _sniff_media_type_bytes(head):
+    """Return 'image'/'svg'/'video'/'audio'/'pdf'/None from content headers."""
+    if _looks_like_svg(head):
+        return "svg"
+    if not head or len(head) < 8:
+        return None
+    if head[:3] == b"\xff\xd8\xff":                                return "image"
+    if head[:8] == b"\x89PNG\r\n\x1a\n":                         return "image"
+    if head[:6] in (b"GIF87a", b"GIF89a"):                            return "image"
+    if head[:4] == b"RIFF" and head[8:12] == b"WEBP":                 return "image"
+    if head[:2] == b"BM":                                             return "image"
+    if head[:4] in (b"II*\x00", b"MM\x00*"):                        return "image"
+    if head[4:12] in (b"ftypavif", b"ftypavis"):                      return "image"
+    if head[4:12] in (b"ftypheic", b"ftypheix", b"ftypmif1", b"ftypmsf1"):
+        return "image"
+    if head[:3] == b"ID3":                                            return "audio"
+    if head[:2] in (b"\xff\xfb", b"\xff\xf3", b"\xff\xf2"):     return "audio"
+    if head[:4] == b"fLaC":                                           return "audio"
+    if head[:4] == b"OggS":                                           return "audio"
+    if head[:4] == b"RIFF" and head[8:12] == b"WAVE":                 return "audio"
+    if head[4:12] in (b"M4A ", b"M4B ", b"M4P "):                     return "audio"
+    if head[:4] == b"\x1a\x45\xdf\xa3":                           return "video"
+    if head[4:8] == b"ftyp":                                          return "video"
+    if head[:4] == b"RIFF" and head[8:12] == b"AVI ":                 return "video"
+    if head[:3] == b"FLV":                                            return "video"
+    if head[:4] == b"0&\xb2u":                                       return "video"
+    if head[:5] == b"%PDF-":                                          return "pdf"
+    return None
+def _sniff_media_type(path):
+    try:
+        with open(path, "rb") as f:
+            data = f.read(4096)
+            kind = _sniff_media_type_bytes(data)
+            if kind is None and _is_svg_path(path):
+                return "svg"
+            return kind
+    except OSError:
+        return None
+def _resolve_input(x):
+    """Normalize any input to (kind, value). Accepts:
+        - PIL.Image                            -> image
+        - np.ndarray HxWx3 uint8               -> image (via PIL.fromarray)
+        - np.ndarray TxHxWx3 uint8             -> video (saved to /tmp via imageio)
+        - np.ndarray 1-D float                 -> audio
+        - np.ndarray 2-D float (C,N) or (N,C)  -> audio (mono mixdown)
+        - torch.Tensor                         -> converted to numpy, recurse
+        - bytes / io.IOBase                    -> sniff + route
+        - str URL                              -> downloaded + routed
+        - str path                             -> content-sniffed + routed
+        - str                                  -> text
+    """
+    import os as _os
+    import io
+    if _is_image(x):
+        return ("image", x)
+    if _is_audio_array(x):
+        return ("audio", x)
+    try:
+        import numpy as _np
+    except ImportError:
+        _np = None
+    if _np is not None and isinstance(x, _np.ndarray):
+        # Image (H,W,3|4) uint8
+        if x.ndim == 3 and x.shape[-1] in (3, 4) and x.dtype == _np.uint8:
+            _PILImage = _pil_image()
+            mode = "RGBA" if x.shape[-1] == 4 else "RGB"
+            return ("image", _PILImage.fromarray(x, mode).convert("RGB"))
+        # Video (T,H,W,3|4) uint8
+        if x.ndim == 4 and x.shape[-1] in (3, 4) and x.dtype == _np.uint8:
+            # Pass frames straight to the processor — no mp4 round-trip, no
+            # av/imageio needed. Drop alpha if present.
+            return ("video", x if x.shape[-1] == 3 else x[..., :3])
+        # Audio multichannel 2D float -> mono mixdown
+        if x.ndim == 2 and _np.issubdtype(x.dtype, _np.floating):
+            audio = x.mean(axis=0 if x.shape[0] <= 8 else 1).astype(_np.float32)
+            return ("audio", audio)
+    # torch.Tensor -> numpy and recurse
+    try:
+        import torch as _torch
+    except ImportError:
+        _torch = None
+    if _torch is not None and isinstance(x, _torch.Tensor):
+        return _resolve_input(x.detach().cpu().numpy())
+    # bytes / BytesIO / file-like
+    if isinstance(x, (bytes, bytearray)):
+        data = bytes(x)
+    elif isinstance(x, io.IOBase):
+        data = x.read()
+    else:
+        data = None
+    if data is not None:
+        kind = _sniff_media_type_bytes(data[:4096])
+        if kind == "image":
+            _PILImage = _pil_image()
+            return ("image", _PILImage.open(io.BytesIO(data)).convert("RGB"))
+        if kind == "svg":
+            return ("image", _svg_to_image(bytes(data)))
+        if kind in ("video", "audio"):
+            import tempfile as _tf
+            ext = ".mp4" if kind == "video" else ".wav"
+            tf = _tf.NamedTemporaryFile(suffix=ext, delete=False)
+            tf.write(data); tf.close()
+            return (kind, tf.name)
+        if kind == "pdf":
+            # pypdfium2 reads bytes directly — no temp file needed.
+            return ("pdf", bytes(data))
+    if isinstance(x, str):
+        local = _download_if_url(x)
+        if _os.path.isfile(local):
+            kind = _sniff_media_type(local)
+            if kind == "image":
+                _PILImage = _pil_image()
+                return ("image", _PILImage.open(local).convert("RGB"))
+            if kind == "svg":
+                return ("image", _svg_to_image(local))
+            if kind in ("video", "audio", "pdf"):
+                return (kind, local)
+        return ("text", x)
+    return ("text", str(x))
+def _is_media_string(x) -> bool:
+    if not isinstance(x, str):
+        return False
+    return _resolve_input(x)[0] in ("image", "video", "audio", "pdf")
+def _prompt_from_kwargs(st_model, kwargs):
+    prompt = kwargs.get("prompt")
+    if prompt is None:
+        prompt_name = kwargs.get("prompt_name") or getattr(st_model, "default_prompt_name", None)
+        prompt = (getattr(st_model, "prompts", {}) or {}).get(prompt_name, "") if prompt_name else ""
+    return prompt or ""
+def _raw_media_parts(st_model, value, kwargs):
+    prompt = _prompt_from_kwargs(st_model, kwargs)
+    return (prompt, value) if prompt else (value,)
+def _prompted_parts(st_model, value, kwargs):
+    parts = value if isinstance(value, tuple) else (value,)
+    prompt = _prompt_from_kwargs(st_model, kwargs)
+    return (prompt, *parts) if prompt else parts
+def _align_eval_processor(processor):
+    video_processor = getattr(processor, "video_processor", None)
+    if video_processor is None:
+        return
+    if hasattr(video_processor, "do_sample_frames"):
+        video_processor.do_sample_frames = False
+    for attr in ("max_frames", "num_frames"):
+        if hasattr(video_processor, attr):
+            setattr(video_processor, attr, EVAL_VIDEO_NUM_FRAMES)
+    if hasattr(video_processor, "size") and isinstance(video_processor.size, dict):
+        video_processor.size = {
+            **video_processor.size,
+            "longest_edge": EVAL_VIDEO_MAX_PIXELS,
+            "shortest_edge": EVAL_IMAGE_MIN_PIXELS,
+        }
+    if hasattr(video_processor, "max_pixels"):
+        video_processor.max_pixels = EVAL_VIDEO_MAX_PIXELS
+    if hasattr(video_processor, "min_pixels"):
+        video_processor.min_pixels = EVAL_IMAGE_MIN_PIXELS
+def _build_eval_image_prompt(processor, prefix: str = ""):
+    image_token = getattr(processor, "image_token", IMAGE_PROMPT)
+    text = f"{prefix or ''}<|vision_start|>{image_token}<|vision_end|>"
+    try:
+        return processor.apply_chat_template(
+            [{"role": "user", "content": text}],
+            tokenize=False,
+            add_generation_prompt=False,
+        )
+    except (ValueError, AttributeError):
+        return f"{prefix or ''}{IMAGE_PROMPT}"
+def _audio_output_length(feature_attention_mask):
+    real_frames = feature_attention_mask.sum(-1)
+    aftercnn = (real_frames - 1) // 2 + 1
+    return int(((aftercnn - 2) // 2 + 1).item())
+def _load_audio_array(audio_input):
+    import numpy as np
+    if isinstance(audio_input, _AudioWrapper):
+        return audio_input.array.astype(np.float32), audio_input.sampling_rate
+    if isinstance(audio_input, str):
+        try:
+            import librosa
+        except ImportError as e:
+            raise ImportError(
+                "Loading audio from a file path needs `pip install librosa`"
+                " (or pass a 1-D numpy float32 waveform at 16 kHz)."
+            ) from e
+        audio, sr = librosa.load(audio_input, sr=16000)
+        return audio.astype(np.float32), sr
+    if isinstance(audio_input, np.ndarray):
+        return audio_input.astype(np.float32), 16000
+    raise TypeError(f"Unsupported audio input type: {type(audio_input)}")
+def _build_audio_model_inputs(owner, audio_input, device, prefix: str = ""):
+    import numpy as np
+    from transformers import WhisperFeatureExtractor
+    audio, sr = _load_audio_array(audio_input)
+    if not np.isfinite(audio).all():
+        audio = np.nan_to_num(audio, nan=0.0, posinf=0.0, neginf=0.0)
+    peak = float(np.max(np.abs(audio))) if audio.size else 0.0
+    if peak > 1.0:
+        audio = audio / peak
+    feat_ext = WhisperFeatureExtractor(feature_size=128)
+    audio_inputs = feat_ext(
+        audio,
+        sampling_rate=sr,
+        return_tensors="pt",
+        padding="max_length",
+        return_attention_mask=True,
+    )
+    input_features = audio_inputs["input_features"]
+    feature_attention_mask = audio_inputs["attention_mask"]
+    n_tokens = _audio_output_length(feature_attention_mask)
+    start = owner.tokenizer.convert_ids_to_tokens(owner.config.audio_start_token_id)
+    token = owner.tokenizer.convert_ids_to_tokens(owner.config.audio_token_id)
+    end = owner.tokenizer.convert_ids_to_tokens(owner.config.audio_end_token_id)
+    audio_seq = start + token * n_tokens + end
+    text = f"{prefix or ''}{audio_seq}"
+    try:
+        prompt = owner.processor.apply_chat_template(
+            [{"role": "user", "content": text}],
+            tokenize=False,
+            add_generation_prompt=False,
+        )
+    except (ValueError, AttributeError):
+        prompt = text
+    out = owner.processor(text=[prompt], return_tensors="pt", padding=False, truncation=False)
+    model_dtype = next(owner.model.parameters()).dtype
+    inputs = {k: v.to(device) for k, v in out.items() if torch.is_tensor(v)}
+    inputs["input_features"] = input_features.to(device=device, dtype=model_dtype)
+    inputs["feature_attention_mask"] = feature_attention_mask.to(device)
+    pos_builder = globals().get("_get_1d_position_ids")
+    if pos_builder is not None:
+        inputs["position_ids"] = pos_builder(inputs["attention_mask"])
+    return inputs
+def _extract_audio_from_video(video_path):
+    """Return mono float32 audio @ 16 kHz decoded from the video's audio track, or
+    None if no audio stream is present. PyAV is already a dep for video decoding."""
+    try:
+        import av
+        import numpy as np
+        from av.audio.resampler import AudioResampler
+    except ImportError:
+        return None
+    container = av.open(video_path)
+    try:
+        audio_stream = next((s for s in container.streams if s.type == "audio"), None)
+        if audio_stream is None:
+            return None
+        resampler = AudioResampler(format="flt", layout="mono", rate=16000)
+        samples = []
+        for frame in container.decode(audio=0):
+            for rf in resampler.resample(frame):
+                samples.append(rf.to_ndarray().flatten())
+        for rf in resampler.resample(None):
+            samples.append(rf.to_ndarray().flatten())
+        if not samples:
+            return None
+        return np.concatenate(samples).astype(np.float32)
+    finally:
+        container.close()
+def _eval_video_frames(video_path):
+    if not isinstance(video_path, str):
+        return video_path
+    try:
+        import av
+        import numpy as np
+    except ImportError:
+        return video_path
+    container = av.open(video_path)
+    try:
+        frames = [frame.to_image().convert("RGB") for frame in container.decode(video=0)]
+    finally:
+        container.close()
+    if not frames:
+        return video_path
+    if len(frames) <= EVAL_VIDEO_NUM_FRAMES:
+        return frames
+    indices = np.linspace(0, len(frames) - 1, EVAL_VIDEO_NUM_FRAMES, dtype=int).tolist()
+    return [frames[i] for i in indices]
+def _pdf_to_images(pdf, dpi: int = PDF_DPI):
+    """Rasterise every page of a PDF to a list of PIL.Image (RGB).
+    `pdf` may be a path, raw bytes, BytesIO, or an existing list of PIL.Images
+    (returned as-is). Lazy-imports `pypdfium2` so users who never touch PDFs
+    are not forced to install it.
+    """
+    _PILImage = _pil_image()  # PIL is a hard dep of the image path
+    if isinstance(pdf, list) and pdf and all(isinstance(p, _PILImage.Image) for p in pdf):
+        return pdf
+    try:
+        import pypdfium2 as pdfium
+    except ImportError as e:
+        raise ImportError(
+            "Decoding PDF pages needs `pip install pypdfium2`."
+        ) from e
+    import io as _io
+    if isinstance(pdf, (bytes, bytearray)):
+        doc = pdfium.PdfDocument(bytes(pdf))
+    elif isinstance(pdf, _io.IOBase):
+        doc = pdfium.PdfDocument(pdf.read())
+    else:
+        doc = pdfium.PdfDocument(pdf)
+    scale = dpi / 72.0
+    pages = []
+    try:
+        for page in doc:
+            pil = page.render(scale=scale).to_pil().convert("RGB")
+            pages.append(pil)
+    finally:
+        doc.close()
+    return pages
+def _patch_st_encode_multipart():
+    """Intercept ST.encode for multipart tuple inputs so PIL.Image and
+    np.ndarray media parts bypass ST's length-sort."""
+    import importlib
+    import torch
+    try:
+        st_mod = importlib.import_module("sentence_transformers.SentenceTransformer")
+    except ImportError:
+        return
+    _ST = st_mod.SentenceTransformer
+    if getattr(_ST.encode, "_omni_multipart_patched", False):
+        return
+    _orig = _ST.encode
+    def _encode(self, sentences, *args, **kwargs):
+        def _is_nonstring_input(x):
+            # anything other than a pure string becomes a 1-part multipart item
+            return not isinstance(x, str)
+        single_bare = _is_nonstring_input(sentences) and not isinstance(sentences, list)
+        list_with_nonstr = (isinstance(sentences, list) and sentences
+                            and any(_is_nonstring_input(s) for s in sentences))
+        single_media_string = isinstance(sentences, str) and _is_media_string(sentences)
+        list_with_media_string = (isinstance(sentences, list) and sentences
+                                  and any(isinstance(s, str) and _is_media_string(s) for s in sentences))
+        fwd_keys = getattr(self[0], "forward_kwargs", set())
+        forward_kwargs = {k: kwargs[k] for k in fwd_keys if k in kwargs}
+        if single_media_string or list_with_media_string:
+            if single_media_string:
+                batch = [_raw_media_parts(self, sentences, kwargs)]
+            else:
+                batch = [_raw_media_parts(self, s, kwargs) for s in sentences]
+            features = {"_multipart_batch": batch, "_is_multipart_batch": True}
+            with torch.no_grad():
+                out = self[0](features, **forward_kwargs)
+            emb = out["sentence_embedding"]
+            if kwargs.get("convert_to_numpy", True):
+                emb = emb.detach().cpu().float().numpy()
+            if single_media_string:
+                emb = emb[0] if hasattr(emb, "__getitem__") else emb
+            return emb
+        if single_bare or list_with_nonstr:
+            if single_bare:
+                batch = [_prompted_parts(self, sentences, kwargs)]
+            else:
+                batch = [_prompted_parts(self, s, kwargs) for s in sentences]
+            features = {"_multipart_batch": batch, "_is_multipart_batch": True}
+            with torch.no_grad():
+                out = self[0](features, **forward_kwargs)
+            emb = out["sentence_embedding"]
+            if kwargs.get("convert_to_numpy", True):
+                emb = emb.detach().cpu().float().numpy()
+            if single_bare:
+                emb = emb[0] if hasattr(emb, "__getitem__") else emb
+            return emb
+        result = _orig(self, sentences, *args, **kwargs)
+        # ST 5.x applies truncate_dim without L2 renormalization; the README
+        # promises unit-norm truncated embeddings, so restore that here.
+        if kwargs.get("truncate_dim") is not None and not kwargs.get("normalize_embeddings", False):
+            import numpy as _np
+            if torch.is_tensor(result):
+                result = torch.nn.functional.normalize(result, p=2, dim=-1)
+            elif isinstance(result, _np.ndarray):
+                n = _np.linalg.norm(result, axis=-1, keepdims=True) + 1e-12
+                result = result / n
+        return result
+    _encode._omni_multipart_patched = True
+    _ST.encode = _encode
+    def encode_query(self, sentences, *args, **kwargs):
+        kwargs.setdefault("prompt_name", "query")
+        return self.encode(sentences, *args, **kwargs)
+    def encode_document(self, sentences, *args, **kwargs):
+        kwargs.setdefault("prompt_name", "document")
+        return self.encode(sentences, *args, **kwargs)
+    _ST.encode_query = encode_query
+    _ST.encode_document = encode_document
+_patch_st_encode_multipart()
+class Transformer(nn.Module):
+    save_in_root: bool = True
+    # Tells sentence-transformers to thread these kwargs from encode() through
+    # to our forward() — otherwise ST filters unknown kwargs out.
+    forward_kwargs = {"task", "truncate_dim"}
+    def __init__(
+        self,
+        model_name_or_path: str = "jinaai/jina-embeddings-v5-omni-nano",
+        max_seq_length: Optional[int] = None,
+        config_args: Optional[Dict[str, Any]] = None,
+        model_args: Optional[Dict[str, Any]] = None,
+        tokenizer_args: Optional[Dict[str, Any]] = None,
+        cache_dir: Optional[str] = None,
+        backend: str = "torch",
+        task: Optional[str] = None,
+        default_task: Optional[str] = None,
+        **kwargs,
+    ) -> None:
+        super().__init__()
+        if backend != "torch":
+            raise ValueError(
+                f"Backend '{backend}' is not supported, please use 'torch' instead"
+            )
+        config_kwargs = dict(config_args or {})
+        model_kwargs = dict(model_args or {})
+        tokenizer_kwargs = dict(tokenizer_args or {})
+        # Default-task resolution precedence (highest to lowest):
+        #   1. `task` / `default_task` kwarg to this __init__
+        #   2. `model_args={'default_task': ...}` (legacy path)
+        #   3. JINA_V5_TASK env var
+        #   4. unset -> encode() must pass task=
+        self.default_task = (
+            task
+            or default_task
+            or model_kwargs.pop("default_task", None)
+            or os.environ.get("JINA_V5_TASK")
+        )
+        if self.default_task and self.default_task not in TASK_NAMES:
+            raise ValueError(
+                f"Invalid task: {self.default_task}. Must be one of {TASK_NAMES}."
+            )
+        # setdefault so caller-provided trust_remote_code isn't duplicated
+        config_kwargs.setdefault("trust_remote_code", True)
+        model_kwargs.setdefault("trust_remote_code", True)
+        tokenizer_kwargs.setdefault("trust_remote_code", True)
+        # Dedupe cache_dir: we pass it explicitly below, so strip any copy
+        # that sentence-transformers may have also threaded through *_args.
+        for _kw in (config_kwargs, model_kwargs, tokenizer_kwargs):
+            _kw.pop("cache_dir", None)
+        self.config = AutoConfig.from_pretrained(
+            model_name_or_path, cache_dir=cache_dir, **config_kwargs
+        )
+        self.model = AutoModel.from_pretrained(
+            model_name_or_path, cache_dir=cache_dir, **model_kwargs,
+        )
+        self.tokenizer = self.model.tokenizer
+        # AutoProcessor pulls in PIL transitively; lazy-import so users on
+        # text-only setups (no pillow installed) can still load the model.
+        try:
+            from transformers import AutoProcessor as _AutoProcessor
+            processor_kwargs = dict(tokenizer_kwargs)
+            processor_kwargs.setdefault("min_pixels", EVAL_IMAGE_MIN_PIXELS)
+            processor_kwargs.setdefault("max_pixels", EVAL_IMAGE_MAX_PIXELS)
+            self.processor = _AutoProcessor.from_pretrained(
+                model_name_or_path, cache_dir=cache_dir, **processor_kwargs,
+            )
+            _align_eval_processor(self.processor)
+        except Exception:
+            self.processor = None
+        tc = getattr(self.config, "text_config", self.config)
+        max_pos = getattr(tc, "max_position_embeddings", MAX_SEQ_LENGTH)
+        self.max_seq_length = max_seq_length or min(max_pos, MAX_SEQ_LENGTH)
+    def tokenize(
+        self,
+        texts: Union[List[str], List[Dict], list],
+        padding: Union[str, bool] = True,
+        **kwargs,
+    ) -> Dict[str, torch.Tensor]:
+        if texts and any(isinstance(t, tuple) for t in texts):
+            # Wrap non-tuple entries as 1-tuples so every batch slot goes
+            # through _encode_parts. Lets users mix: [(t,img), "plain text"].
+            wrapped = [t if isinstance(t, tuple) else (t,) for t in texts]
+            return {"_multipart_batch": wrapped, "_is_multipart_batch": True}
+        resolved = [_resolve_input(t) for t in texts]
+        # Heterogeneous batch (e.g. ["speech.wav", "plain text"]) — route through
+        # the multipart path where each element is dispatched on its own kind.
+        if len({k for k, _ in resolved}) > 1:
+            wrapped = [t if isinstance(t, tuple) else (t,) for t in texts]
+            return {"_multipart_batch": wrapped, "_is_multipart_batch": True}
+        first_kind = resolved[0][0]
+        values = [v for _, v in resolved]
+        if first_kind == "image":
+            return {"_images": values, "_is_image_batch": True}
+        if first_kind == "video":
+            return {"_video_paths": values, "_is_video_batch": True}
+        if first_kind == "audio":
+            return {"_audio_paths": values, "_is_audio_batch": True}
+        if first_kind == "pdf":
+            return {"_pdfs": values, "_is_pdf_batch": True}
+        if isinstance(texts[0], dict):
+            texts = [next(iter(t.values())) for t in texts]
+        elif isinstance(texts[0], (list, tuple)):
+            texts = [t[0] for t in texts]
+        return self.tokenizer(
+            [str(s) for s in texts],
+            max_length=self.max_seq_length,
+            truncation=True,
+            padding=padding,
+            return_tensors="pt",
+        )
+    def _resolve_task(self, task: Optional[str]) -> str:
+        if task is None:
+            if self.default_task is None:
+                raise ValueError(
+                    "Task must be specified. Set it during loading "
+                    "(model_kwargs={'default_task': 'retrieval'}) or pass "
+                    "task='retrieval' to encode()."
+                )
+            task = self.default_task
+        if task not in TASK_NAMES:
+            raise ValueError(f"Invalid task: {task}. Must be one of {TASK_NAMES}.")
+        return task
+    def _last_token_pool(self, hidden, attention_mask):
+        seq_lens = attention_mask.sum(dim=1) - 1
+        pooled = hidden[torch.arange(hidden.shape[0], device=hidden.device), seq_lens]
+        return F.normalize(pooled, p=2, dim=-1).float()
+    def _encode_single_image(self, image, device, prefix: str = "") -> torch.Tensor:
+        prompt = _build_eval_image_prompt(self.processor, prefix=prefix)
+        inputs = self.processor(images=image, text=prompt, return_tensors="pt", truncation=False)
+        inputs = {k: v.to(device) for k, v in inputs.items() if torch.is_tensor(v)}
+        with torch.no_grad():
+            hidden = self.model(**inputs).last_hidden_state
+        return self._last_token_pool(hidden, inputs["attention_mask"]).squeeze(0)
+    def _encode_single_video(self, video_path, device) -> torch.Tensor:
+        video = _eval_video_frames(video_path)
+        inputs = self.processor(videos=video, text=VIDEO_PROMPT, return_tensors="pt", truncation=False)
+        inputs = {k: v.to(device) for k, v in inputs.items() if torch.is_tensor(v)}
+        with torch.no_grad():
+            hidden = self.model(**inputs).last_hidden_state
+        return self._last_token_pool(hidden, inputs["attention_mask"]).squeeze(0)
+    def _encode_single_audio(self, audio_input, device, prefix: str = "") -> torch.Tensor:
+        inputs = _build_audio_model_inputs(self, audio_input, device, prefix=prefix)
+        with torch.no_grad():
+            hidden = self.model(**inputs).last_hidden_state
+        return self._last_token_pool(hidden, inputs["attention_mask"]).squeeze(0)
+    def _encode_single_pdf(self, pdf, device) -> torch.Tensor:
+        """Encode a PDF as a fused sequence of page images (single embedding).
+        Pages are rasterised with pypdfium2 then fed through the same
+        multipart fusion path used for tuples — so a 3-page PDF produces
+        a single embedding spanning all three rendered pages.
+        """
+        pages = _pdf_to_images(pdf)
+        if not pages:
+            raise ValueError("PDF has 0 pages — nothing to encode.")
+        return self._encode_parts(tuple(pages), device)
+    def _encode_composite_parts(self, expanded, device) -> torch.Tensor:
+        import numpy as np
+        from transformers import WhisperFeatureExtractor
+        content = []
+        images, videos = [], []
+        audio_features, feature_masks = [], []
+        feat_ext = None
+        for kind, p in expanded:
+            if kind == "text":
+                content.append({"type": "text", "text": str(p)})
+            elif kind == "image":
+                content.append({"type": "image"})
+                images.append(p)
+            elif kind == "video":
+                content.append({"type": "video"})
+                videos.append(_eval_video_frames(p) if isinstance(p, str) else p)
+            elif kind == "audio":
+                if feat_ext is None:
+                    feat_ext = WhisperFeatureExtractor(feature_size=128)
+                audio_arr, sr = _load_audio_array(p)
+                if not np.isfinite(audio_arr).all():
+                    audio_arr = np.nan_to_num(audio_arr, nan=0.0, posinf=0.0, neginf=0.0)
+                peak = float(np.max(np.abs(audio_arr))) if audio_arr.size else 0.0
+                if peak > 1.0:
+                    audio_arr = audio_arr / peak
+                audio_inputs = feat_ext(
+                    audio_arr,
+                    sampling_rate=sr,
+                    return_tensors="pt",
+                    padding="max_length",
+                    return_attention_mask=True,
+                )
+                feat_mask = audio_inputs["attention_mask"]
+                n_tokens = _audio_output_length(feat_mask)
+                start = self.tokenizer.convert_ids_to_tokens(self.config.audio_start_token_id)
+                token = self.tokenizer.convert_ids_to_tokens(self.config.audio_token_id)
+                end = self.tokenizer.convert_ids_to_tokens(self.config.audio_end_token_id)
+                content.append({"type": "text", "text": start + token * n_tokens + end})
+                audio_features.append(audio_inputs["input_features"])
+                feature_masks.append(feat_mask)
+        has_chat_template = getattr(self.processor, "chat_template", None) is not None
+        if has_chat_template:
+            prompt = self.processor.apply_chat_template(
+                [{"role": "user", "content": content}],
+                tokenize=False,
+                add_generation_prompt=False,
+            )
+            if images or videos:
+                image_token = getattr(self.processor, "image_token", "<|image_pad|>")
+                video_token = getattr(self.processor, "video_token", "<|video_pad|>")
+                flat = []
+                for c in content:
+                    if c.get("type") == "text":
+                        flat.append(c["text"])
+                    elif c.get("type") == "image":
+                        flat.append(f"<|vision_start|>{image_token}<|vision_end|>")
+                    elif c.get("type") == "video":
+                        flat.append(f"<|vision_start|>{video_token}<|vision_end|>")
+                prompt_flat = self.processor.apply_chat_template(
+                    [{"role": "user", "content": "".join(flat)}],
+                    tokenize=False,
+                    add_generation_prompt=False,
+                )
+                if "<|vision_start|>" in prompt_flat:
+                    prompt = prompt_flat
+        else:
+            pieces = []
+            for c in content:
+                if c.get("type") == "text":
+                    pieces.append(c["text"])
+                elif c.get("type") == "image":
+                    pieces.append(IMAGE_PROMPT)
+                elif c.get("type") == "video":
+                    pieces.append(VIDEO_PROMPT)
+            prompt = "".join(pieces)
+        proc_kwargs = {"text": [prompt], "return_tensors": "pt", "padding": False, "truncation": False}
+        if images:
+            proc_kwargs["images"] = images
+        if videos:
+            proc_kwargs["videos"] = videos
+        out = self.processor(**proc_kwargs)
+        model_dtype = next(self.model.parameters()).dtype
+        inputs = {k: v.to(device) if torch.is_tensor(v) else v for k, v in out.items()}
+        if audio_features:
+            inputs["input_features"] = torch.cat(audio_features, dim=0).to(device=device, dtype=model_dtype)
+            inputs["feature_attention_mask"] = torch.cat(feature_masks, dim=0).to(device)
+        if "Qwen" in type(self.processor).__name__:
+            ids = inputs["input_ids"].squeeze(0)
+            mm_ids = torch.zeros_like(ids, dtype=torch.int32)
+            image_token_id = self.processor.tokenizer.convert_tokens_to_ids(getattr(self.processor, "image_token", "<image>"))
+            video_token_id = self.processor.tokenizer.convert_tokens_to_ids(getattr(self.processor, "video_token", "<video>"))
+            audio_token_id = self.processor.tokenizer.convert_tokens_to_ids(self.tokenizer.convert_ids_to_tokens(self.config.audio_token_id))
+            mm_ids += (ids == image_token_id).to(torch.int32)
+            mm_ids += 2 * (ids == video_token_id).to(torch.int32)
+            mm_ids += 3 * (ids == audio_token_id).to(torch.int32)
+            inputs["mm_token_type_ids"] = mm_ids.unsqueeze(0)
+            mask = inputs["attention_mask"]
+            pos = mask.long().cumsum(-1) - 1
+            pos = pos.masked_fill(mask == 0, 0)
+            inputs["position_ids"] = pos.unsqueeze(0).expand(3, -1, -1).contiguous()
+        else:
+            pos_builder = globals().get("_get_1d_position_ids")
+            if pos_builder is not None:
+                inputs["position_ids"] = pos_builder(inputs["attention_mask"])
+        with torch.no_grad():
+            hidden = self.model(**inputs).last_hidden_state
+        return self._last_token_pool(hidden, inputs["attention_mask"]).squeeze(0)
+    def _encode_parts(self, parts, device) -> torch.Tensor:
+        """Fuse a tuple of parts into one embedding in a single forward pass.
+        Each part may be a URL, a local path (sniffed by magic bytes if no
+        extension), a PIL.Image, a 1-D numpy audio array, a PDF (rasterised
+        to one image per page), or plain text. A video with an audio track
+        is auto-expanded to [extracted_audio, video] so the audio tokens
+        precede the video tokens.
+        """
+        import numpy as np
+        from transformers import WhisperFeatureExtractor
+        # Normalize every part first (URL -> path, content-sniff if needed).
+        resolved = [_resolve_input(p) for p in parts]
+        # Expand videos-with-audio: prepend extracted audio.
+        # Expand PDFs: rasterise into one image-part per page.
+        expanded = []
+        for kind, value in resolved:
+            if kind == "video":
+                if isinstance(value, str):
+                    aud = _extract_audio_from_video(value)
+                    if aud is not None and aud.size > 0:
+                        expanded.append(("audio", aud))
+                expanded.append(("video", value))
+            elif kind == "pdf":
+                for page in _pdf_to_images(value):
+                    expanded.append(("image", page))
+            else:
+                expanded.append((kind, value))
+        ids_chunks, mask_chunks = [], []
+        pix_images, img_grid = [], []
+        pix_videos, vid_grid = [], []
+        audio_features = []
+        feat_ext = None
+        if len(expanded) == 1 and expanded[0][0] == "image":
+            return self._encode_single_image(expanded[0][1], device)
+        if len(expanded) == 1 and expanded[0][0] == "audio":
+            return self._encode_single_audio(expanded[0][1], device)
+        if len(expanded) == 2 and expanded[0][0] == "text" and expanded[1][0] == "image":
+            return self._encode_single_image(expanded[1][1], device, prefix=str(expanded[0][1]))
+        if len(expanded) == 2 and expanded[0][0] == "text" and expanded[1][0] == "audio":
+            return self._encode_single_audio(expanded[1][1], device, prefix=str(expanded[0][1]))
+        return self._encode_composite_parts(expanded, device)
+    def forward(
+        self,
+        features: Dict[str, torch.Tensor],
+        task: Optional[str] = None,
+        truncate_dim: Optional[int] = None,
+        **kwargs,
+    ) -> Dict[str, torch.Tensor]:
+        self.model.eval()
+        device = next(self.model.parameters()).device
+        task = self._resolve_task(task)
+        self.model.set_adapter([task])
+        if features.get("_is_multipart_batch"):
+            embs = [self._encode_parts(parts, device) for parts in features["_multipart_batch"]]
+            features["sentence_embedding"] = torch.stack(embs)
+            return self._maybe_truncate(features, truncate_dim)
+        if features.get("_is_image_batch"):
+            embs = [self._encode_single_image(img, device) for img in features["_images"]]
+            features["sentence_embedding"] = torch.stack(embs)
+            return self._maybe_truncate(features, truncate_dim)
+        if features.get("_is_video_batch"):
+            embs = [self._encode_single_video(p, device) for p in features["_video_paths"]]
+            features["sentence_embedding"] = torch.stack(embs)
+            return self._maybe_truncate(features, truncate_dim)
+        if features.get("_is_audio_batch"):
+            embs = [self._encode_single_audio(p, device) for p in features["_audio_paths"]]
+            features["sentence_embedding"] = torch.stack(embs)
+            return self._maybe_truncate(features, truncate_dim)
+        if features.get("_is_pdf_batch"):
+            embs = [self._encode_single_pdf(p, device) for p in features["_pdfs"]]
+            features["sentence_embedding"] = torch.stack(embs)
+            return self._maybe_truncate(features, truncate_dim)
+        batch = {k: v.to(device) for k, v in features.items() if torch.is_tensor(v)}
+        with torch.no_grad():
+            hidden = self.model(**batch).last_hidden_state
+        features["sentence_embedding"] = self._last_token_pool(hidden, batch["attention_mask"])
+        return self._maybe_truncate(features, truncate_dim)
+    @staticmethod
+    def _maybe_truncate(features, truncate_dim):
+        # Slicing an L2-normalized vector and renormalizing is equivalent to
+        # truncate-then-normalize on the raw pooled vector — so this produces a
+        # unit-norm matryoshka embedding.
+        if truncate_dim is not None:
+            emb = features["sentence_embedding"][..., :truncate_dim]
+            features["sentence_embedding"] = F.normalize(emb, p=2, dim=-1)
+        return features
+    def get_word_embedding_dimension(self) -> int:
+        tc = getattr(self.config, "text_config", self.config)
+        return getattr(tc, "hidden_size", 768)
+    def get_sentence_embedding_dimension(self) -> int:
+        return self.get_word_embedding_dimension()
+    def get_max_seq_length(self) -> int:
+        return self.max_seq_length
+    def save(self, output_path: str, safe_serialization: bool = True, **kwargs) -> None:
+        self.model.save_pretrained(output_path, safe_serialization=safe_serialization)
+        self.tokenizer.save_pretrained(output_path)
+        config = {"max_seq_length": self.max_seq_length}
+        with open(os.path.join(output_path, "sentence_bert_config.json"), "w") as f:
+            json.dump(config, f, indent=2)
+    @classmethod
+    def load(cls, input_path: str) -> "Transformer":
+        # Signature must have exactly 1 param so ST routes through the direct
+        # constructor path (which maps model_kwargs -> model_args correctly).
+        config_path = os.path.join(input_path, "sentence_bert_config.json")
+        extra = {}
+        if os.path.exists(config_path):
+            with open(config_path) as f:
+                extra = json.load(f)
+        return cls(model_name_or_path=input_path, **extra)

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1b25867d693db0304d5030b8d6ba833b3bc371dc002b2559e49524e0ea405b4b
+size 1972058984

modeling_jina_embeddings_v5_omni.py ADDED Viewed

	@@ -0,0 +1,616 @@

+"""
+Unified jina-embeddings-v5-omni-nano: vision + audio + text with task-specific routing.
+Shared:  Qwen3VLVisionModel + Qwen2.5-Omni audio encoder + LlamaModel (EuroBERT, bidirectional)
+Per-task: vision merger, audio projector, special token embeddings, LoRA adapter
+Modality loading:
+    model = AutoModel.from_pretrained(path, trust_remote_code=True)                      # all components (default)
+    model = AutoModel.from_pretrained(path, trust_remote_code=True, modality="vision")    # no audio tower/projectors
+    model = AutoModel.from_pretrained(path, trust_remote_code=True, modality="audio")     # no vision tower/mergers
+Usage:
+    model = AutoModel.from_pretrained("jinaai/jina-embeddings-v5-omni-nano", trust_remote_code=True)
+    embeddings = model.encode(["hello world"], task="retrieval")
+"""
+from typing import List, Optional
+import os
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from huggingface_hub import snapshot_download
+from transformers import AutoTokenizer, LlamaConfig, PreTrainedModel, PretrainedConfig
+from transformers.modeling_outputs import BaseModelOutputWithPast
+from transformers.models.llama.modeling_llama import LlamaModel
+from transformers.models.qwen3_vl.configuration_qwen3_vl import Qwen3VLVisionConfig
+from transformers.models.qwen3_vl.modeling_qwen3_vl import Qwen3VLVisionModel
+from transformers.models.qwen2_5_omni.configuration_qwen2_5_omni import Qwen2_5OmniAudioEncoderConfig
+from transformers.models.qwen2_5_omni.modeling_qwen2_5_omni import Qwen2_5OmniAudioEncoder
+from peft import PeftMixedModel, PeftConfig
+TASK_NAMES = ["retrieval", "text-matching", "clustering", "classification"]
+_VALID_MODALITIES = ("omni", "vision", "audio", "text")
+def _key(task):
+    return task.replace("-", "_")
+class PretrainedMerger(nn.Module):
+    def __init__(self, hidden_size, out_hidden_size, spatial_merge_size=2):
+        super().__init__()
+        self.hidden_size = hidden_size * (spatial_merge_size ** 2)
+        self.norm = nn.LayerNorm(hidden_size, eps=1e-6)
+        self.linear_fc1 = nn.Linear(self.hidden_size, self.hidden_size)
+        self.act = nn.GELU()
+        self.linear_fc2 = nn.Linear(self.hidden_size, out_hidden_size)
+    def forward(self, x):
+        x = self.norm(x)
+        x = x.view(-1, self.hidden_size)
+        x = self.linear_fc2(self.act(self.linear_fc1(x)))
+        return x
+class JinaEmbeddingsV5OmniConfig(PretrainedConfig):
+    model_type = "jina_embeddings_v5_omni"
+    def __init__(
+        self,
+        vision_config=None,
+        text_config=None,
+        audio_config=None,
+        task_names=None,
+        special_token_ids=None,
+        image_token_index=None,
+        audio_token_id=None,
+        audio_start_token_id=None,
+        audio_end_token_id=None,
+        projector_hidden_act="gelu",
+        tie_word_embeddings=False,
+        modality="omni",
+        **kwargs,
+    ):
+        if isinstance(vision_config, dict):
+            vision_config = PretrainedConfig(**vision_config)
+        self.vision_config = vision_config or PretrainedConfig()
+        if isinstance(text_config, dict):
+            text_config = PretrainedConfig(**text_config)
+        self.text_config = text_config or PretrainedConfig()
+        if isinstance(audio_config, dict):
+            audio_config = PretrainedConfig(**audio_config)
+        self.audio_config = audio_config or PretrainedConfig()
+        self.task_names = task_names or TASK_NAMES
+        self.special_token_ids = special_token_ids or []
+        self.image_token_index = image_token_index
+        self.audio_token_id = audio_token_id
+        self.audio_start_token_id = audio_start_token_id
+        self.audio_end_token_id = audio_end_token_id
+        self.projector_hidden_act = projector_hidden_act
+        if modality not in _VALID_MODALITIES:
+            raise ValueError(f"modality must be one of {_VALID_MODALITIES}, got '{modality}'")
+        self.modality = modality
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
+    def get_text_config(self, **kwargs):
+        return self.text_config
+class JinaEmbeddingsV5OmniBase(PreTrainedModel):
+    config_class = JinaEmbeddingsV5OmniConfig
+    supports_gradient_checkpointing = True
+    _supports_sdpa = True
+    _supports_flash_attn_2 = True
+    _supports_attention_backend = True
+    _tied_weights_keys = []
+    _keys_to_ignore_on_load_missing = ["lm_head.weight"]
+    _keys_to_ignore_on_load_unexpected = []
+    def __init__(self, config: JinaEmbeddingsV5OmniConfig):
+        super().__init__(config)
+        modality = getattr(config, "modality", "omni")
+        if modality not in _VALID_MODALITIES:
+            raise ValueError(f"modality must be one of {_VALID_MODALITIES}, got '{modality}'")
+        self._modality = modality
+        vision_cfg = config.vision_config
+        if not isinstance(vision_cfg, Qwen3VLVisionConfig):
+            d = vision_cfg.to_dict() if hasattr(vision_cfg, "to_dict") else dict(vision_cfg)
+            d.pop("model_type", None)
+            d.pop("transformers_version", None)
+            vision_cfg = Qwen3VLVisionConfig(**d)
+        vision_cfg.deepstack_visual_indexes = []
+        spatial_merge_size = getattr(vision_cfg, "spatial_merge_size", 2)
+        self._spatial_merge_size = spatial_merge_size
+        self._vision_hidden_size = vision_cfg.hidden_size
+        text_cfg = config.text_config
+        txt_dict = text_cfg.to_dict() if hasattr(text_cfg, "to_dict") else text_cfg
+        if not isinstance(text_cfg, LlamaConfig):
+            text_cfg = LlamaConfig(**txt_dict)
+        text_hidden = text_cfg.hidden_size
+        if modality not in ("audio", "text"):
+            self.vision_tower = Qwen3VLVisionModel(vision_cfg)
+            self.vision_tower.merger = nn.Identity()
+            self.vision_tower.deepstack_merger_list = nn.ModuleList()
+            self.vision_tower.deepstack_visual_indexes = []
+            self.mergers = nn.ModuleDict({
+                _key(t): PretrainedMerger(vision_cfg.hidden_size, text_hidden, spatial_merge_size)
+                for t in config.task_names
+            })
+        self.language_model = LlamaModel(text_cfg)
+        for layer in self.language_model.layers:
+            layer.self_attn.is_causal = False
+        self.multi_modal_projector = nn.Identity()
+        self.lm_head = nn.Identity()
+        if modality not in ("vision", "text"):
+            aud_cfg = config.audio_config
+            aud_dict = aud_cfg.to_dict() if hasattr(aud_cfg, "to_dict") else aud_cfg
+            audio_encoder_config = Qwen2_5OmniAudioEncoderConfig(**aud_dict)
+            self.audio_tower = Qwen2_5OmniAudioEncoder(audio_encoder_config)
+            self.audio_tower.proj = nn.Identity()  # fused into audio_projector(s)
+            output_dim = aud_dict.get('d_model', 1280)  # fused: audio_projector(s) now take d_model
+            self.audio_projectors = nn.ModuleDict({
+                _key(t): nn.Linear(output_dim, text_hidden) for t in config.task_names
+            })
+        ignore = []
+        if modality in ("audio", "text"):
+            ignore.extend([r"^vision_tower\.", r"^mergers\."])
+        if modality in ("vision", "text"):
+            ignore.extend([r"^audio_tower\.", r"^audio_projectors\."])
+        if ignore:
+            self._keys_to_ignore_on_load_unexpected = ignore
+        n_special = len(config.special_token_ids)
+        self.task_token_embeddings = nn.ParameterDict({
+            _key(t): nn.Parameter(torch.zeros(n_special, text_hidden))
+            for t in config.task_names
+        })
+        self._active_task_key = _key(config.task_names[0])
+        self._special_token_ids = config.special_token_ids
+        self.post_init()
+    @property
+    def modality(self) -> str:
+        return self._modality
+    def set_task(self, task):
+        k = _key(task)
+        self._active_task_key = k
+        with torch.no_grad():
+            w = self.language_model.embed_tokens.weight.data
+            te = self.task_token_embeddings[k]
+            for i, tid in enumerate(self._special_token_ids):
+                w[tid] = te[i]
+    def get_input_embeddings(self):
+        return self.language_model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.language_model.embed_tokens = value
+    def get_output_embeddings(self):
+        return None
+    def get_image_features(self, pixel_values, image_grid_thw, num_image_tokens=None):
+        if self._modality in ("audio", "text"):
+            raise ValueError(
+                f"Vision inputs are not available in {self._modality}-only mode. "
+                "Load with modality='omni' or modality='vision'."
+            )
+        out = self.vision_tower(hidden_states=pixel_values, grid_thw=image_grid_thw)
+        raw = out[0] if isinstance(out, tuple) else getattr(out, "last_hidden_state", out[0])
+        merged = self.mergers[self._active_task_key](raw)
+        merge = self._spatial_merge_size
+        sizes = []
+        for i in range(image_grid_thw.shape[0]):
+            t, h, w = image_grid_thw[i].tolist()
+            sizes.append(int(t) * (int(h) // merge) * (int(w) // merge))
+        # Default: return the un-padded per-image feature slices. Their
+        # concatenation has exactly sum(sizes) rows == number of <image>
+        # placeholder tokens in input_ids, which is what masked_scatter
+        # consumes. Padding is only meaningful when callers want a square
+        # [N, max_tok, dim] block (e.g. multi-sample batched forward where
+        # each row owns its own image), and that path passes
+        # num_image_tokens explicitly to opt in.
+        dim = merged.shape[-1]
+        features, offset = [], 0
+        if num_image_tokens is not None:
+            max_tok = num_image_tokens
+            for n in sizes:
+                feat = merged[offset:offset + n]
+                if n < max_tok:
+                    feat = torch.cat([feat, feat.new_zeros(max_tok - n, dim)], dim=0)
+                features.append(feat)
+                offset += n
+        else:
+            for n in sizes:
+                features.append(merged[offset:offset + n])
+                offset += n
+        return features
+    def get_audio_features(self, input_features, feature_attention_mask=None):
+        if self._modality in ("vision", "text"):
+            raise ValueError(
+                f"Audio inputs are not available in {self._modality}-only mode. "
+                "Load with modality='omni' or modality='audio'."
+            )
+        batch_size = input_features.shape[0]
+        if batch_size > 1:
+            # Serialize per-sample so the packed-frames GEMM shape stays invariant
+            # across batch sizes. Makes batched audio bit-exact to B=1 in bf16,
+            # and is substantially faster for B>=16 because B=1 hits a
+            # well-optimized kernel while the packed-B=N path thrashes on a
+            # (total_frames)^2 sdpa matrix.
+            outs = [
+                self.get_audio_features(
+                    input_features[i : i + 1],
+                    feature_attention_mask[i : i + 1] if feature_attention_mask is not None else None,
+                )
+                for i in range(batch_size)
+            ]
+            return torch.cat(outs, dim=0)
+        if feature_attention_mask is not None:
+            feature_lens = feature_attention_mask.sum(-1).long()
+            packed = input_features.permute(0, 2, 1)[feature_attention_mask.bool()].permute(1, 0)
+        else:
+            feature_lens = torch.full(
+                (batch_size,), input_features.shape[2],
+                device=input_features.device, dtype=torch.long,
+            )
+            packed = input_features.transpose(1, 2).reshape(-1, input_features.shape[1]).T
+        aftercnn_lens, _ = self.audio_tower._get_feat_extract_output_lengths(feature_lens)
+        audio_output = self.audio_tower(
+            packed, feature_lens=feature_lens, aftercnn_lens=aftercnn_lens,
+        )
+        return self.audio_projectors[self._active_task_key](audio_output.last_hidden_state)
+    def forward(
+        self,
+        input_ids=None,
+        pixel_values=None,
+        attention_mask=None,
+        position_ids=None,
+        past_key_values=None,
+        inputs_embeds=None,
+        input_features=None,
+        feature_attention_mask=None,
+        cache_position=None,
+        output_hidden_states=None,
+        **kwargs,
+    ):
+        image_grid_thw = kwargs.pop("image_grid_thw", None)
+        num_image_tokens = kwargs.pop("num_image_tokens", None)
+        pixel_values_videos = kwargs.pop("pixel_values_videos", None)
+        video_grid_thw = kwargs.pop("video_grid_thw", None)
+        num_video_tokens = kwargs.pop("num_video_tokens", None)
+        kwargs.pop("spatial_shapes", None)
+        kwargs.pop("pixel_attention_mask", None)
+        if pixel_values is not None and self._modality in ("audio", "text"):
+            raise ValueError(
+                f"Vision inputs are not available in {self._modality}-only mode. "
+                "Load with modality='omni' or modality='vision'."
+            )
+        if input_features is not None and self._modality in ("vision", "text"):
+            raise ValueError(
+                f"Audio inputs are not available in {self._modality}-only mode. "
+                "Load with modality='omni' or modality='audio'."
+            )
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+        if inputs_embeds is None:
+            inputs_embeds = self.get_input_embeddings()(input_ids)
+        # Image and video both use config.image_token_index (the processor
+        # remaps <|video_pad|> to <image>). When a single forward pass mixes
+        # both modalities, the mask matches both sets of placeholders, so we
+        # build one combined source with image features first then video
+        # features, matching the order of placeholders in input_ids.
+        all_feats = []
+        if pixel_values is not None and image_grid_thw is not None:
+            all_feats.extend(self.get_image_features(pixel_values, image_grid_thw, num_image_tokens))
+        if pixel_values_videos is not None and video_grid_thw is not None:
+            all_feats.extend(self.get_image_features(pixel_values_videos, video_grid_thw, num_video_tokens))
+        if all_feats:
+            feats = torch.cat(all_feats, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
+            mask = (input_ids == self.config.image_token_index).unsqueeze(-1).expand_as(inputs_embeds)
+            inputs_embeds = inputs_embeds.masked_scatter(mask, feats)
+        if input_features is not None:
+            aud = self.get_audio_features(input_features, feature_attention_mask)
+            aud_flat = aud.reshape(-1, aud.shape[-1]).to(inputs_embeds.device, inputs_embeds.dtype)
+            mask = (input_ids == self.config.audio_token_id).unsqueeze(-1).expand_as(inputs_embeds)
+            inputs_embeds = inputs_embeds.masked_scatter(mask, aud_flat)
+        if attention_mask is not None and attention_mask.dim() == 2:
+            dtype = inputs_embeds.dtype
+            seq_len = inputs_embeds.shape[1]
+            bidi = attention_mask[:, None, None, :].to(dtype=dtype)
+            bidi = (1.0 - bidi) * torch.finfo(dtype).min
+            attention_mask = bidi.expand(-1, -1, seq_len, -1)
+        out = self.language_model(
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            cache_position=cache_position,
+            output_hidden_states=output_hidden_states,
+        )
+        return BaseModelOutputWithPast(
+            last_hidden_state=self.lm_head(out[0]),
+            past_key_values=out.past_key_values,
+            hidden_states=out.hidden_states,
+            attentions=out.attentions,
+        )
+class JinaEmbeddingsV5OmniModel(PeftMixedModel):
+    config_class = JinaEmbeddingsV5OmniConfig
+    @classmethod
+    def register_for_auto_class(cls, auto_class="AutoModel"):
+        return PreTrainedModel.register_for_auto_class.__func__(cls, auto_class)
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+        modality = kwargs.pop("modality", None)
+        task_kwarg = kwargs.pop("task", None)
+        config = kwargs.pop("config", None)
+        if config is None:
+            config = JinaEmbeddingsV5OmniConfig.from_pretrained(pretrained_model_name_or_path)
+        if modality is not None:
+            config.modality = modality
+        elif not hasattr(config, "modality") or config.modality is None:
+            config.modality = "omni"
+        default_dtype = getattr(config, "torch_dtype", None) or torch.float32
+        base_model = JinaEmbeddingsV5OmniBase.from_pretrained(
+            pretrained_model_name_or_path,
+            config=config,
+            torch_dtype=kwargs.pop("torch_dtype", kwargs.pop("dtype", default_dtype)),
+        )
+        if os.path.isdir(pretrained_model_name_or_path):
+            adapters_dir = os.path.join(pretrained_model_name_or_path, "adapters")
+        else:
+            cache = snapshot_download(
+                repo_id=pretrained_model_name_or_path,
+                allow_patterns=["adapters/*"],
+            )
+            adapters_dir = os.path.join(cache, "adapters")
+        adapter_paths = {
+            name: os.path.join(adapters_dir, name) for name in config.task_names
+        }
+        peft_config = PeftConfig.from_pretrained(adapter_paths["retrieval"], **kwargs)
+        model = cls(base_model, peft_config, adapter_name="retrieval")
+        model._pretrained_path = pretrained_model_name_or_path
+        for name in config.task_names:
+            model.load_adapter(adapter_paths[name], adapter_name=name, **kwargs)
+        model.tokenizer = AutoTokenizer.from_pretrained(
+            pretrained_model_name_or_path, trust_remote_code=True,
+        )
+        # Task precedence: kwarg > config.task (hf_overrides path) > env var > default.
+        task = task_kwarg
+        if task is None:
+            task = getattr(config, "task", None)
+        if task is None:
+            task = os.environ.get("JINA_V5_TASK")
+        if task is None:
+            task = config.task_names[0]
+        if task not in config.task_names:
+            raise ValueError(
+                f"task must be one of {config.task_names}, got '{task}'"
+            )
+        model.set_adapter(task)
+        return model
+    @property
+    def modality(self) -> str:
+        return self.base_model.model.modality
+    def set_adapter(self, adapters):
+        super().set_adapter(adapters)
+        task = adapters[0] if isinstance(adapters, list) else adapters
+        self.base_model.model.set_task(task)
+    def encode(
+        self,
+        texts: List[str],
+        task: str,
+        prompt_name: Optional[str] = "document",
+        truncate_dim: Optional[int] = None,
+        max_length: Optional[int] = None,
+    ) -> torch.Tensor:
+        cfg = self.base_model.model.config
+        if task not in cfg.task_names:
+            raise ValueError(f"Unknown task: {task}")
+        if prompt_name is None:
+            prompt_name = "document"
+        if prompt_name not in {"query", "document"}:
+            raise ValueError(f"Unknown prompt_name: {prompt_name}")
+        prefix = "Query: " if prompt_name == "query" else "Document: "
+        inputs = [f"{prefix}{t}" for t in texts]
+        max_length = max_length or cfg.text_config.max_position_embeddings
+        batch = self.tokenizer(
+            inputs, return_tensors="pt", padding=True, truncation=True, max_length=max_length,
+        )
+        device = next(self.parameters()).device
+        batch = {k: v.to(device) for k, v in batch.items()}
+        self.set_adapter([task])
+        self.eval()
+        with torch.no_grad():
+            hidden = self(**batch).last_hidden_state
+            mask = batch.get("attention_mask")
+            if mask is None:
+                pooled = hidden[:, -1]
+            else:
+                seq_lens = mask.sum(dim=1) - 1
+                pooled = hidden[torch.arange(hidden.shape[0], device=hidden.device), seq_lens]
+            if truncate_dim is not None:
+                pooled = pooled[:, :truncate_dim]
+            return F.normalize(pooled, p=2, dim=-1)
+    def embed(self, truncate_dim: Optional[int] = None, **inputs):
+        """Encode processor outputs into L2-normalized last-token embeddings.
+        Matryoshka: pass `truncate_dim=N` to get an N-dim unit-norm vector
+        (truncation is applied before L2-normalization).
+        """
+        attention_mask = inputs.get("attention_mask", None)
+        self.eval()
+        with torch.no_grad():
+            out = self(**inputs)
+        hidden = out.last_hidden_state
+        if attention_mask is not None and attention_mask.dim() == 2:
+            idx = attention_mask.sum(dim=1) - 1
+        else:
+            idx = torch.full(
+                (hidden.shape[0],), hidden.shape[1] - 1,
+                device=hidden.device, dtype=torch.long,
+            )
+        pooled = hidden[torch.arange(hidden.shape[0], device=hidden.device), idx]
+        if truncate_dim is not None:
+            pooled = pooled[:, :truncate_dim]
+        return torch.nn.functional.normalize(pooled, dim=-1)
+# ---------------------------------------------------------------------------
+# vLLM registration (side-effect on module import).
+#
+# Triggered via config.json "auto_map.AutoConfig" -> this module.
+# HF / sentence-transformers path unaffected: any failure is silently swallowed
+# so that pure transformers users never see a vLLM error.
+# ---------------------------------------------------------------------------
+def _register_vllm() -> None:
+    # All vLLM references are resolved via importlib so transformers'
+    # static check_imports does NOT flag vllm as a required dependency.
+    # Pure-HF / sentence-transformers usage is unaffected.
+    #
+    # When loaded via transformers' `trust_remote_code=True`, only the
+    # modeling_*.py referenced in auto_map is fetched into the
+    # transformers_modules cache — sibling vLLM adapter files are NOT.
+    # We pull them from HF Hub before registering; otherwise vLLM falls
+    # back to its transformers backend (wrong attention semantics) and
+    # multi-request batches collapse.
+    import importlib.util as _iu
+    if _iu.find_spec("vllm") is None:
+        return
+    try:
+        import os
+        import sys
+        import importlib
+        import inspect
+        import shutil
+        pkg = __package__ or ""
+        current_dir = os.path.dirname(os.path.abspath(__file__))
+        siblings = ("vllm_llava_eurobert_audio", "vllm_jina_v5_omni")
+        for sibling_name in siblings:
+            sibling_path = os.path.join(current_dir, sibling_name + ".py")
+            if os.path.exists(sibling_path):
+                continue
+            parts = pkg.split(".")
+            if len(parts) < 4 or parts[0] != "transformers_modules":
+                continue
+            from huggingface_hub import hf_hub_download
+            repo_name = parts[2].replace("_hyphen_", "-").replace("_dot_", ".")
+            repo_id = f"{parts[1]}/{repo_name}"
+            downloaded = hf_hub_download(
+                repo_id=repo_id,
+                filename=sibling_name + ".py",
+                revision=parts[3],
+            )
+            shutil.copy(downloaded, sibling_path)
+        os.environ.setdefault("VLLM_ENABLE_V1_MULTIPROCESSING", "0")
+        _kvc = importlib.import_module("vllm.v1.core.kv_cache_coordinator")
+        _orig = _kvc.get_kv_cache_coordinator
+        _NoPrefix = _kvc.KVCacheCoordinatorNoPrefixCache
+        _orig_sig = inspect.signature(_orig)
+        _noprefix_sig = inspect.signature(_NoPrefix)
+        def _patched(kv_cache_config, max_model_len, *args, **kwargs):
+            if len(kv_cache_config.kv_cache_groups) == 0:
+                bound = _orig_sig.bind(kv_cache_config, max_model_len, *args, **kwargs)
+                return _NoPrefix(**{
+                    name: bound.arguments[name]
+                    for name in _noprefix_sig.parameters
+                    if name in bound.arguments
+                })
+            return _orig(kv_cache_config, max_model_len, *args, **kwargs)
+        _kvc.get_kv_cache_coordinator = _patched
+        # Make sibling-dir importable from a fresh subprocess too — vLLM's
+        # inspect_model_cls runs in a child Python process that doesn't
+        # inherit our sys.modules. Without this on PYTHONPATH the
+        # string-spec model registration below can't be resolved.
+        if current_dir not in sys.path:
+            sys.path.insert(0, current_dir)
+        existing = os.environ.get("PYTHONPATH", "")
+        if current_dir not in existing.split(os.pathsep):
+            os.environ["PYTHONPATH"] = (
+                current_dir if not existing else current_dir + os.pathsep + existing
+            )
+        if pkg:
+            _lla = importlib.import_module(".vllm_llava_eurobert_audio", package=pkg)
+            _omni = importlib.import_module(".vllm_jina_v5_omni", package=pkg)
+        else:
+            _lla = importlib.import_module("vllm_llava_eurobert_audio")
+            _omni = importlib.import_module("vllm_jina_v5_omni")
+        _ = _lla.LlavaEuroBertAudioForVLLMEmbedding  # keep reference
+        ModelRegistry = importlib.import_module(
+            "vllm.model_executor.models"
+        ).ModelRegistry
+        # String spec ("module:Class") — survives vLLM's cloudpickle-into-
+        # subprocess flow because the child re-imports by name. Passing the
+        # class object directly registers __module__ as the qualified
+        # transformers_modules.jinaai.<...> path, which the subprocess
+        # can't resolve without HF's dynamic-module setup.
+        ModelRegistry.register_model(
+            "JinaEmbeddingsV5OmniModel",
+            "vllm_jina_v5_omni:JinaV5OmniForVLLMEmbedding",
+        )
+    except Exception as e:
+        import warnings
+        warnings.warn(
+            f"jina-embeddings-v5-omni base: vLLM registration failed "
+            f"({type(e).__name__}: {e}); embeddings will fall back to "
+            f"vLLM's generic transformers backend (wrong tensor layout).",
+            stacklevel=2,
+        )
+_register_vllm()

modeling_llava_eurobert_audio.py ADDED Viewed

	@@ -0,0 +1,400 @@

+"""
+LlavaEuroBertAudioForEmbedding: Qwen3VL vision + Qwen2.5-Omni audio + EuroBERT text.
+Architecture:
+  - Vision: Qwen3VLVisionModel (with RoPE, 3D Conv3d patch embed, all layers)
+  - Merger: PretrainedMerger (top-level, NOT inside vision_tower)
+  - Audio: Qwen2_5OmniAudioEncoder (Qwen2.5-Omni) + Linear projector
+  - Text: LlamaModel (EuroBERT, bidirectional)
+  - LM head: Identity (embedding model, no vocab projection)
+Modality loading:
+  model = AutoModel.from_pretrained(path, trust_remote_code=True, modality="omni")   # all components (default)
+  model = AutoModel.from_pretrained(path, trust_remote_code=True, modality="vision") # no audio tower/projector
+  model = AutoModel.from_pretrained(path, trust_remote_code=True, modality="audio")  # no vision tower/merger
+"""
+from typing import List, Optional, Union
+import torch
+import torch.nn as nn
+from transformers import LlamaConfig, PreTrainedModel, PretrainedConfig
+from transformers.modeling_outputs import BaseModelOutputWithPast
+from transformers.models.llama.modeling_llama import LlamaModel
+from transformers.models.qwen3_vl.configuration_qwen3_vl import Qwen3VLVisionConfig
+from transformers.models.qwen3_vl.modeling_qwen3_vl import Qwen3VLVisionModel
+from transformers.models.qwen2_5_omni.configuration_qwen2_5_omni import Qwen2_5OmniAudioEncoderConfig
+from transformers.models.qwen2_5_omni.modeling_qwen2_5_omni import Qwen2_5OmniAudioEncoder
+_VALID_MODALITIES = ("omni", "vision", "audio", "text")
+class PretrainedMerger(nn.Module):
+    def __init__(self, hidden_size, out_hidden_size, spatial_merge_size=2):
+        super().__init__()
+        self.hidden_size = hidden_size * (spatial_merge_size**2)
+        self.norm = nn.LayerNorm(hidden_size, eps=1e-6)
+        self.linear_fc1 = nn.Linear(self.hidden_size, self.hidden_size)
+        self.act = nn.GELU()
+        self.linear_fc2 = nn.Linear(self.hidden_size, out_hidden_size)
+    def forward(self, x):
+        x = self.norm(x)
+        x = x.view(-1, self.hidden_size)
+        x = self.linear_fc2(self.act(self.linear_fc1(x)))
+        return x
+class LlavaEuroBertAudioConfig(PretrainedConfig):
+    model_type = "llava_eurobert_audio"
+    def __init__(
+        self,
+        vision_config=None,
+        text_config=None,
+        audio_config=None,
+        image_token_index=None,
+        audio_token_id=None,
+        audio_start_token_id=None,
+        audio_end_token_id=None,
+        projector_hidden_act="gelu",
+        tie_word_embeddings=False,
+        modality="omni",
+        **kwargs,
+    ):
+        if isinstance(vision_config, dict):
+            vision_config = PretrainedConfig(**vision_config)
+        self.vision_config = vision_config or PretrainedConfig()
+        if isinstance(text_config, dict):
+            text_config = PretrainedConfig(**text_config)
+        self.text_config = text_config or PretrainedConfig()
+        if isinstance(audio_config, dict):
+            audio_config = PretrainedConfig(**audio_config)
+        self.audio_config = audio_config or PretrainedConfig()
+        self.image_token_index = image_token_index
+        self.audio_token_id = audio_token_id
+        self.audio_start_token_id = audio_start_token_id
+        self.audio_end_token_id = audio_end_token_id
+        self.projector_hidden_act = projector_hidden_act
+        if modality not in _VALID_MODALITIES:
+            raise ValueError(f"modality must be one of {_VALID_MODALITIES}, got '{modality}'")
+        self.modality = modality
+        super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
+    def get_text_config(self, **kwargs):
+        return self.text_config
+class LlavaEuroBertAudioForEmbedding(PreTrainedModel):
+    config_class = LlavaEuroBertAudioConfig
+    supports_gradient_checkpointing = True
+    _supports_sdpa = True
+    _supports_flash_attn_2 = True
+    _supports_attention_backend = True
+    _tied_weights_keys = []
+    _keys_to_ignore_on_load_missing = ["lm_head.weight"]
+    _keys_to_ignore_on_load_unexpected = []
+    def __init__(self, config: LlavaEuroBertAudioConfig):
+        super().__init__(config)
+        modality = getattr(config, "modality", "omni")
+        if modality not in _VALID_MODALITIES:
+            raise ValueError(f"modality must be one of {_VALID_MODALITIES}, got '{modality}'")
+        self._modality = modality
+        vision_cfg = config.vision_config
+        if not isinstance(vision_cfg, Qwen3VLVisionConfig):
+            if hasattr(vision_cfg, "to_dict"):
+                d = vision_cfg.to_dict()
+            else:
+                d = dict(vision_cfg)
+            d.pop("model_type", None)
+            d.pop("transformers_version", None)
+            vision_cfg = Qwen3VLVisionConfig(**d)
+        vision_cfg.deepstack_visual_indexes = []
+        spatial_merge_size = getattr(vision_cfg, "spatial_merge_size", 2)
+        text_cfg = config.text_config
+        if not isinstance(text_cfg, LlamaConfig):
+            txt_dict = text_cfg.to_dict() if hasattr(text_cfg, 'to_dict') else dict(text_cfg)
+            _saved_attn_impl = getattr(text_cfg, "_attn_implementation", None)
+            text_cfg = LlamaConfig(**txt_dict)
+            if _saved_attn_impl is not None:
+                text_cfg._attn_implementation = _saved_attn_impl
+        text_hidden = text_cfg.hidden_size
+        self._spatial_merge_size = spatial_merge_size
+        self._vision_hidden_size = getattr(vision_cfg, "hidden_size", 768)
+        if modality not in ("audio", "text"):
+            self.vision_tower = Qwen3VLVisionModel(vision_cfg)
+            self.vision_tower.merger = nn.Identity()
+            self.vision_tower.deepstack_merger_list = nn.ModuleList()
+            self.vision_tower.deepstack_visual_indexes = []
+            self.merger = PretrainedMerger(
+                vision_cfg.hidden_size, text_hidden, spatial_merge_size
+            )
+        self.multi_modal_projector = nn.Identity()
+        self.language_model = LlamaModel(text_cfg)
+        self.lm_head = nn.Identity()
+        for layer in self.language_model.layers:
+            layer.self_attn.is_causal = False
+        if modality not in ("vision", "text"):
+            aud_cfg = config.audio_config
+            aud_dict = aud_cfg.to_dict() if hasattr(aud_cfg, 'to_dict') else aud_cfg
+            audio_encoder_config = Qwen2_5OmniAudioEncoderConfig(**aud_dict)
+            self.audio_tower = Qwen2_5OmniAudioEncoder(audio_encoder_config)
+            output_dim = aud_dict.get('output_dim', 3584)
+            self.audio_projector = nn.Linear(output_dim, text_hidden)
+        ignore = []
+        if modality in ("audio", "text"):
+            ignore.extend([r"^vision_tower\.", r"^merger\."])
+        if modality in ("vision", "text"):
+            ignore.extend([r"^audio_tower\.", r"^audio_projector\."])
+        if ignore:
+            self._keys_to_ignore_on_load_unexpected = ignore
+        self.post_init()
+    @property
+    def modality(self) -> str:
+        return self._modality
+    def get_input_embeddings(self):
+        return self.language_model.embed_tokens
+    def set_input_embeddings(self, value):
+        self.language_model.embed_tokens = value
+    def get_output_embeddings(self):
+        return None
+    def get_image_features(
+        self,
+        pixel_values: torch.FloatTensor,
+        image_grid_thw: torch.LongTensor,
+        num_image_tokens: Optional[int] = None,
+    ) -> List[torch.Tensor]:
+        if self._modality in ("audio", "text"):
+            raise ValueError(
+                f"Vision inputs are not available in {self._modality}-only mode. "
+                "Load with modality='omni' or modality='vision'."
+            )
+        vision_output = self.vision_tower(
+            hidden_states=pixel_values, grid_thw=image_grid_thw
+        )
+        if isinstance(vision_output, tuple):
+            raw_hidden = vision_output[0]
+        elif hasattr(vision_output, "pooler_output") and vision_output.pooler_output is not None:
+            raw_hidden = vision_output.pooler_output
+        else:
+            raw_hidden = vision_output[0]
+        image_features = self.merger(raw_hidden)
+        merge_sq = self._spatial_merge_size ** 2
+        split_sizes = (image_grid_thw.prod(-1) // merge_sq).tolist()
+        return list(torch.split(image_features, split_sizes))
+    def get_audio_features(
+        self,
+        input_features: torch.FloatTensor,
+        feature_attention_mask: Optional[torch.LongTensor] = None,
+    ) -> torch.Tensor:
+        if self._modality in ("vision", "text"):
+            raise ValueError(
+                f"Audio inputs are not available in {self._modality}-only mode. "
+                "Load with modality='omni' or modality='audio'."
+            )
+        batch_size = input_features.shape[0]
+        if batch_size > 1:
+            # Serialize per-sample so the packed-frames GEMM shape stays invariant
+            # across batch sizes. Makes batched audio bit-exact to B=1 in bf16,
+            # and is substantially faster for B>=16 because B=1 hits a
+            # well-optimized kernel while the packed-B=N path thrashes on a
+            # (total_frames)^2 sdpa matrix.
+            outs = [
+                self.get_audio_features(
+                    input_features[i : i + 1],
+                    feature_attention_mask[i : i + 1] if feature_attention_mask is not None else None,
+                )
+                for i in range(batch_size)
+            ]
+            return torch.cat(outs, dim=0)
+        if feature_attention_mask is not None:
+            feature_lens = feature_attention_mask.sum(-1).long()
+            packed = input_features.permute(0, 2, 1)[feature_attention_mask.bool()].permute(1, 0)
+        else:
+            feature_lens = torch.full(
+                (batch_size,), input_features.shape[2],
+                device=input_features.device, dtype=torch.long,
+            )
+            packed = input_features.transpose(1, 2).reshape(-1, input_features.shape[1]).T
+        aftercnn_lens, _ = self.audio_tower._get_feat_extract_output_lengths(feature_lens)
+        audio_output = self.audio_tower(
+            packed, feature_lens=feature_lens, aftercnn_lens=aftercnn_lens,
+        )
+        return self.audio_projector(audio_output.last_hidden_state)
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_values=None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        input_features: Optional[torch.FloatTensor] = None,
+        feature_attention_mask: Optional[torch.LongTensor] = None,
+        cache_position: Optional[torch.LongTensor] = None,
+        output_hidden_states: Optional[bool] = None,
+        **kwargs,
+    ):
+        image_grid_thw = kwargs.pop("image_grid_thw", None)
+        num_image_tokens = kwargs.pop("num_image_tokens", None)
+        kwargs.pop("spatial_shapes", None)
+        kwargs.pop("pixel_attention_mask", None)
+        if pixel_values is not None and self._modality in ("audio", "text"):
+            raise ValueError(
+                f"Vision inputs are not available in {self._modality}-only mode. "
+                "Load with modality='omni' or modality='vision'."
+            )
+        if input_features is not None and self._modality in ("vision", "text"):
+            raise ValueError(
+                f"Audio inputs are not available in {self._modality}-only mode. "
+                "Load with modality='omni' or modality='audio'."
+            )
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError(
+                "You must specify exactly one of input_ids or inputs_embeds"
+            )
+        if inputs_embeds is None:
+            inputs_embeds = self.get_input_embeddings()(input_ids)
+        if pixel_values is not None and image_grid_thw is not None:
+            image_features = self.get_image_features(
+                pixel_values=pixel_values,
+                image_grid_thw=image_grid_thw,
+                num_image_tokens=num_image_tokens,
+            )
+            image_features = torch.cat(image_features, dim=0).to(
+                inputs_embeds.device, inputs_embeds.dtype
+            )
+            special_image_mask = (
+                (input_ids == self.config.image_token_index)
+                .unsqueeze(-1)
+                .expand_as(inputs_embeds)
+            )
+            inputs_embeds = inputs_embeds.masked_scatter(
+                special_image_mask, image_features
+            )
+        if input_features is not None:
+            audio_embeds = self.get_audio_features(
+                input_features, feature_attention_mask
+            )
+            audio_embeds_flat = audio_embeds.reshape(
+                -1, audio_embeds.shape[-1]
+            ).to(inputs_embeds.device, inputs_embeds.dtype)
+            audio_mask = (
+                (input_ids == self.config.audio_token_id)
+                .unsqueeze(-1)
+                .expand_as(inputs_embeds)
+            )
+            inputs_embeds = inputs_embeds.masked_scatter(
+                audio_mask, audio_embeds_flat
+            )
+        if attention_mask is not None and attention_mask.dim() == 2:
+            dtype = inputs_embeds.dtype
+            seq_len = inputs_embeds.shape[1]
+            bidi_mask = attention_mask[:, None, None, :].to(dtype=dtype)
+            bidi_mask = (1.0 - bidi_mask) * torch.finfo(dtype).min
+            attention_mask = bidi_mask.expand(-1, -1, seq_len, -1)
+        # vLLM's transformers backend passes `return_dict=False` + `attention_instances`.
+        # Force dict-style output internally, and forward remaining kwargs so the
+        # vllm attention hook receives its `attention_instances` dict.
+        kwargs.pop("return_dict", None)
+        outputs = self.language_model(
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            inputs_embeds=inputs_embeds,
+            cache_position=cache_position,
+            output_hidden_states=output_hidden_states,
+            return_dict=True,
+            **kwargs,
+        )
+        hidden_states = outputs[0]
+        logits = self.lm_head(hidden_states)
+        return BaseModelOutputWithPast(
+            last_hidden_state=logits,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+        )
+def _register_vllm() -> None:
+    import importlib.util as _iu
+    if _iu.find_spec("vllm") is None:
+        return
+    try:
+        import os, sys, importlib, shutil
+        pkg = __package__ or ""
+        current_dir = os.path.dirname(os.path.abspath(__file__))
+        sibling_name = "vllm_llava_eurobert_audio"
+        sibling_path = os.path.join(current_dir, sibling_name + ".py")
+        if not os.path.exists(sibling_path):
+            parts = pkg.split(".")
+            if len(parts) >= 4 and parts[0] == "transformers_modules":
+                from huggingface_hub import hf_hub_download
+                repo_name = parts[2].replace("_hyphen_", "-").replace("_dot_", ".")
+                repo_id = f"{parts[1]}/{repo_name}"
+                downloaded = hf_hub_download(
+                    repo_id=repo_id,
+                    filename=sibling_name + ".py",
+                    revision=parts[3],
+                )
+                shutil.copy(downloaded, sibling_path)
+        if current_dir not in sys.path:
+            sys.path.insert(0, current_dir)
+        existing = os.environ.get("PYTHONPATH", "")
+        if current_dir not in existing.split(os.pathsep):
+            os.environ["PYTHONPATH"] = (
+                current_dir if not existing else current_dir + os.pathsep + existing
+            )
+        if pkg:
+            _lla = importlib.import_module("." + sibling_name, package=pkg)
+        else:
+            _lla = importlib.import_module(sibling_name)
+        from vllm import ModelRegistry
+        ModelRegistry.register_model(
+            "LlavaEuroBertAudioForEmbedding",
+            _lla.LlavaEuroBertAudioForVLLMEmbedding,
+        )
+    except Exception as e:
+        import warnings
+        warnings.warn(
+            f"jina-embeddings-v5-omni nano: vLLM registration failed "
+            f"({type(e).__name__}: {e}); falling back to Transformers backend.",
+            stacklevel=2,
+        )
+_register_vllm()

modules.json ADDED Viewed

	@@ -0,0 +1,8 @@

+[
+  {
+    "idx": 0,
+    "name": "transformer",
+    "path": "",
+    "type": "custom_st.Transformer"
+  }
+]

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "Qwen2VLImageProcessor",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "merge_size": 2,
+  "patch_size": 16,
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "temporal_patch_size": 2,
+  "min_pixels": 262144,
+  "max_pixels": 1310720
+}

processing_llava_eurobert.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""Custom processor for jina-embeddings-v5-omni-nano.
+Keeps Qwen2VL image/video preprocessing (pixel_values, pixel_values_videos,
+image_grid_thw, video_grid_thw) but maps both media placeholders to nano's
+<image> token instead of Qwen's <|image_pad|> / <|video_pad|>.
+Qwen2VLProcessor expands self.image_token once per image_grid_thw entry and
+self.video_token once per video_grid_thw entry. Overriding both to "<image>"
+makes the super() call expand either modality into N consecutive <image>
+tokens where N = prod(grid_thw) // merge_size**2.
+"""
+from transformers.models.qwen2_vl.processing_qwen2_vl import (
+    Qwen2VLProcessor,
+)
+class LlavaEuroBertProcessor(Qwen2VLProcessor):
+    def __init__(
+        self,
+        image_processor=None,
+        tokenizer=None,
+        video_processor=None,
+        chat_template=None,
+        **kwargs,
+    ):
+        super().__init__(
+            image_processor=image_processor,
+            tokenizer=tokenizer,
+            video_processor=video_processor,
+            chat_template=chat_template,
+            **kwargs,
+        )
+        self.image_token = "<image>"
+        self.image_token_id = tokenizer.convert_tokens_to_ids(
+            self.image_token
+        )
+        self.video_token = "<image>"
+        self.video_token_id = self.image_token_id
+    def __call__(
+        self, images=None, text=None, videos=None, **kwargs
+    ):
+        if text is not None:
+            if isinstance(text, str):
+                text = [text]
+            text = [
+                t.replace(
+                    "<|vision_start|><|image_pad|><|vision_end|>",
+                    "<image>",
+                )
+                .replace(
+                    "<|vision_start|><|video_pad|><|vision_end|>",
+                    "<image>",
+                )
+                .replace("<|image_pad|>", "<image>")
+                .replace("<|video_pad|>", "<image>")
+                .replace("<|vision_start|>", "")
+                .replace("<|vision_end|>", "")
+                for t in text
+            ]
+        return super().__call__(
+            images=images, text=text, videos=videos, **kwargs
+        )

processor_config.json ADDED Viewed

	@@ -0,0 +1,68 @@

+{
+  "processor_class": "LlavaEuroBertProcessor",
+  "auto_map": {
+    "AutoProcessor": "processing_llava_eurobert.LlavaEuroBertProcessor"
+  },
+  "image_processor": {
+    "image_processor_type": "Qwen2VLImageProcessorFast",
+    "do_convert_rgb": true,
+    "do_normalize": true,
+    "do_rescale": true,
+    "do_resize": true,
+    "image_mean": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "image_std": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "min_pixels": 262144,
+    "max_pixels": 1310720,
+    "size": {
+      "longest_edge": 16777216,
+      "shortest_edge": 65536
+    },
+    "merge_size": 2,
+    "patch_size": 16,
+    "resample": 3,
+    "rescale_factor": 0.00392156862745098,
+    "temporal_patch_size": 2
+  },
+  "video_processor": {
+    "video_processor_type": "Qwen3VLVideoProcessor",
+    "do_convert_rgb": true,
+    "do_normalize": true,
+    "do_rescale": true,
+    "do_resize": true,
+    "do_sample_frames": true,
+    "fps": 2,
+    "image_mean": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "image_std": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "max_frames": 768,
+    "min_frames": 4,
+    "merge_size": 2,
+    "patch_size": 16,
+    "resample": 3,
+    "rescale_factor": 0.00392156862745098,
+    "size": {
+      "longest_edge": 25165824,
+      "shortest_edge": 4096
+    },
+    "temporal_patch_size": 2
+  },
+  "image_token": "<image>",
+  "num_additional_image_tokens": 0,
+  "patch_size": null,
+  "vision_feature_select_strategy": null
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b8135ff0f019acbce7c4b93d1fb24cb8325b5fb0d76c57bbde8101a73cba7fa9
+size 17211089

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "backend": "tokenizers",
+  "bos_token": "<|begin_of_text|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|end_of_text|>",
+  "is_local": false,
+  "mask_token": "<|mask|>",
+  "max_length": null,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_to_multiple_of": null,
+  "pad_token": "<|pad|>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "processor_class": "LlavaEuroBertProcessor",
+  "tokenizer_class": "TokenizersBackend",
+  "auto_map": {
+    "AutoProcessor": "processing_llava_eurobert.LlavaEuroBertProcessor"
+  }
+}

video_preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "do_sample_frames": false,
+  "fps": 2,
+  "min_frames": 4,
+  "max_frames": 32,
+  "size": {
+    "longest_edge": 12845056,
+    "shortest_edge": 262144
+  },
+  "patch_size": 16,
+  "merge_size": 2,
+  "temporal_patch_size": 2,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "rescale_factor": 0.00392156862745098,
+  "resample": 3
+}

vllm_jina_v5_omni.py ADDED Viewed

	@@ -0,0 +1,175 @@

+"""vLLM implementation for jina-embeddings-v5-omni-nano / -small base models.
+The base models expose:
+  - shared LM + vision + audio weights,
+  - a per-task LoRA adapter (in adapters/{task}/adapter_model.safetensors),
+  - a per-task PretrainedMerger (vision projection),
+  - a per-task audio_projector,
+  - per-task extra token embeddings applied to language_model.embed_tokens.
+vLLM requires a concrete static model at load time, so we resolve the task from
+the environment variable JINA_V5_TASK (default: retrieval). At load_weights time
+we read the base safetensors + the selected adapter, merge LoRA into Q/K/V/O and
+gate/up/down projections, rename the task-specific mergers/projectors/token
+embeddings to their singular form, and stream the resulting state dict into the
+existing LlavaEuroBertAudioForVLLMEmbedding weight loader — producing a forward
+that is identical to the jinaai/jina-embeddings-v5-omni-nano-{task} variant.
+One task per vLLM instance; spawn separate servers for multi-task serving.
+"""
+from __future__ import annotations
+import json
+import os
+from pathlib import Path
+from typing import Iterable
+import torch
+from safetensors import safe_open
+try:
+    # Package import — works when HF dynamic-module-loader places this
+    # under transformers_modules.<...>.
+    from .vllm_llava_eurobert_audio import LlavaEuroBertAudioForVLLMEmbedding
+except ImportError:
+    # Top-level import — works when this dir was added to PYTHONPATH
+    # (e.g. by vLLM's spawn child during inspect_model_cls).
+    from vllm_llava_eurobert_audio import LlavaEuroBertAudioForVLLMEmbedding
+_TASK_KEY_MAP = {
+    "retrieval": "retrieval",
+    "text-matching": "text_matching",
+    "clustering": "clustering",
+    "classification": "classification",
+}
+_ATTN_MODULES = {"q_proj", "k_proj", "v_proj", "o_proj"}
+_MLP_MODULES = {"gate_proj", "up_proj", "down_proj"}
+def _resolve_local_dir(model_path: str) -> Path:
+    if os.path.isdir(model_path):
+        return Path(model_path)
+    from huggingface_hub import snapshot_download
+    return Path(snapshot_download(
+        repo_id=model_path,
+        allow_patterns=["model.safetensors", "config.json", "adapters/*"],
+    ))
+def _lora_target_key(layer_idx: int, module: str, side: str) -> str:
+    parent = "self_attn" if module in _ATTN_MODULES else "mlp"
+    return (
+        f"base_model.model.language_model.layers.{layer_idx}."
+        f"{parent}.{module}.lora_{side}.weight"
+    )
+def _materialize_task(base_dir: Path, task: str) -> dict[str, torch.Tensor]:
+    task_key = _TASK_KEY_MAP[task]
+    lora_dir = base_dir / "adapters" / task
+    base_cfg = json.loads((base_dir / "config.json").read_text())
+    special_tokens: list[int] = base_cfg["special_token_ids"]
+    adapter_cfg = json.loads((lora_dir / "adapter_config.json").read_text())
+    scale = adapter_cfg["lora_alpha"] / adapter_cfg["r"]
+    with safe_open(str(base_dir / "model.safetensors"), framework="pt") as f:
+        base = {k: f.get_tensor(k) for k in f.keys()}
+    with safe_open(str(lora_dir / "adapter_model.safetensors"), framework="pt") as f:
+        adapter = {k: f.get_tensor(k) for k in f.keys()}
+    merged: dict[str, torch.Tensor] = {}
+    for key, tensor in base.items():
+        if key.startswith("language_model.layers."):
+            parts = key.split(".")
+            # language_model.layers.{i}.{self_attn|mlp}.{module}.weight
+            if len(parts) == 6 and parts[-1] == "weight":
+                layer_idx = int(parts[2])
+                parent = parts[3]
+                module = parts[4]
+                if (parent == "self_attn" and module in _ATTN_MODULES) or (
+                    parent == "mlp" and module in _MLP_MODULES
+                ):
+                    ak = _lora_target_key(layer_idx, module, "A")
+                    bk = _lora_target_key(layer_idx, module, "B")
+                    a = adapter.get(ak)
+                    b = adapter.get(bk)
+                    if a is not None and b is not None:
+                        delta = (b.to(torch.float32) @ a.to(torch.float32)) * scale
+                        tensor = (tensor.to(torch.float32) + delta).to(tensor.dtype)
+            merged[key] = tensor
+        elif key == "language_model.embed_tokens.weight":
+            tensor = tensor.clone()
+            te_key = f"task_token_embeddings.{task_key}"
+            te = base.get(te_key)
+            if te is not None:
+                for i, tid in enumerate(special_tokens):
+                    tensor[tid] = te[i].to(tensor.dtype)
+            merged[key] = tensor
+        elif key.startswith("mergers."):
+            prefix = f"mergers.{task_key}."
+            if key.startswith(prefix):
+                merged["merger." + key[len(prefix):]] = tensor
+        elif key.startswith("audio_projectors."):
+            prefix = f"audio_projectors.{task_key}."
+            if key.startswith(prefix):
+                merged["audio_projector." + key[len(prefix):]] = tensor
+        elif key.startswith("task_token_embeddings."):
+            # Consumed into embed_tokens above.
+            pass
+        else:
+            merged[key] = tensor
+    return merged
+class JinaV5OmniForVLLMEmbedding(LlavaEuroBertAudioForVLLMEmbedding):
+    """vLLM wrapper for the base jina-embeddings-v5-omni-{nano,small}.
+    Reads JINA_V5_TASK env var; merges base + adapter[task] + task components at
+    load time. Resulting forward equals the jinaai/jina-embeddings-v5-omni-*-{task}
+    task variant.
+    """
+    def __init__(self, *, vllm_config, prefix: str = ""):
+        super().__init__(vllm_config=vllm_config, prefix=prefix)
+        model = getattr(vllm_config.model_config, "model", None)
+        if not isinstance(model, str):
+            raise RuntimeError(
+                "JinaV5OmniForVLLMEmbedding requires a string model path; got "
+                f"{type(model).__name__}"
+            )
+        self._base_dir = _resolve_local_dir(model)
+    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        # Task precedence: config.task (hf_overrides) > env var. No silent
+        # fallback — running base+vLLM without picking a task would embed
+        # with the wrong adapter.
+        task = getattr(self.config, "task", None)
+        if task is None:
+            task = os.environ.get("JINA_V5_TASK")
+        if task is None:
+            raise ValueError(
+                "JinaV5OmniForVLLMEmbedding requires a task selection. Pass "
+                "hf_overrides={'task': X} to LLM(...) or set JINA_V5_TASK=X "
+                "in the environment, where X is one of "
+                f"{sorted(_TASK_KEY_MAP)}."
+            )
+        if task not in _TASK_KEY_MAP:
+            raise ValueError(
+                f"task must be one of {sorted(_TASK_KEY_MAP)}, got '{task}'"
+            )
+        # The streamed `weights` arg only covers base model.safetensors; we need
+        # the adapters too, so we materialize from disk directly and discard the
+        # incoming stream.
+        for _ in weights:
+            pass
+        materialized = _materialize_task(self._base_dir, task)
+        return super().load_weights(iter(materialized.items()))

vllm_llava_eurobert_audio.py ADDED Viewed

	@@ -0,0 +1,889 @@

+"""
+vLLM model implementation for LlavaEuroBertAudioForEmbedding (nano multimodal embedding).
+Combines:
+  - Vision: Qwen3VL vision encoder + PretrainedMerger
+  - Audio: Qwen2_5OmniAudioEncoder (from Qwen2.5-Omni-7B) + Linear projector
+  - Text: LlamaModel / EuroBERT (bidirectional)
+Usage:
+    from vllm import ModelRegistry
+    ModelRegistry.register_model(
+        "LlavaEuroBertAudioForEmbedding",
+        "vllm_llava_eurobert_audio:LlavaEuroBertAudioForVLLMEmbedding",
+    )
+    vllm serve /path/to/model --task embedding --trust-remote-code
+"""
+import os
+from collections.abc import Iterable, Mapping, Sequence
+from pathlib import Path
+from typing import Annotated, Any, Literal, TypeAlias
+import numpy as np
+import torch
+import torch.nn as nn
+from transformers import BatchFeature
+from transformers.models.qwen2_5_omni.configuration_qwen2_5_omni import (
+    Qwen2_5OmniAudioEncoderConfig,
+)
+from transformers.models.qwen2_5_omni.modeling_qwen2_5_omni import Qwen2_5OmniAudioEncoder
+from transformers.models.qwen3_vl.configuration_qwen3_vl import Qwen3VLVisionConfig
+from transformers.models.qwen3_vl.modeling_qwen3_vl import Qwen3VLVisionModel
+from transformers.models.whisper import WhisperFeatureExtractor
+from vllm.config import VllmConfig
+try:
+    # vllm >= 0.11
+    from vllm.config.multimodal import BaseDummyOptions
+except ImportError:
+    # vllm < 0.11 — BaseDummyOptions didn't exist; use a lightweight stand-in
+    # so the signature annotation still parses.
+    class BaseDummyOptions:  # type: ignore[no-redef]
+        pass
+try:
+    # vllm < 0.11: types re-exported via vllm.inputs
+    from vllm.inputs import MultiModalDataDict, ModalityData
+except ImportError:
+    # vllm >= 0.11: moved to vllm.multimodal.inputs
+    from vllm.multimodal.inputs import MultiModalDataDict, ModalityData
+from vllm.multimodal import MULTIMODAL_REGISTRY
+from vllm.multimodal.inputs import (
+    AudioItem,
+    MultiModalFieldConfig,
+    MultiModalKwargsItems,
+)
+from vllm.multimodal.parse import (
+    DictEmbeddingItems,
+    ModalityDataItems,
+    MultiModalDataItems,
+    MultiModalDataParser,
+)
+try:
+    # vllm < 0.11
+    from vllm.multimodal.processing import BaseDummyInputsBuilder
+except ImportError:
+    # vllm >= 0.11: moved to vllm.multimodal.profiling
+    from vllm.multimodal.profiling import BaseDummyInputsBuilder
+from vllm.multimodal.processing import (
+    BaseMultiModalProcessor,
+    BaseProcessingInfo,
+    PromptReplacement,
+    PromptUpdate,
+)
+from vllm.sequence import IntermediateTensors
+from vllm.utils.tensor_schema import TensorSchema, TensorShape
+from vllm.model_executor.models.interfaces import (
+    MultiModalEmbeddings,
+    SupportsMultiModal,
+    SupportsPP,
+)
+from vllm.model_executor.models.qwen2_vl import _create_qwen2vl_field_factory
+from vllm.model_executor.models.utils import (
+    AutoWeightsLoader,
+    init_vllm_registered_model,
+    maybe_prefix,
+)
+# --------------------------------------------------------------------------- #
+#  PretrainedMerger (same architecture as HuggingFace version)
+# --------------------------------------------------------------------------- #
+class PretrainedMerger(nn.Module):
+    def __init__(self, hidden_size, out_hidden_size, spatial_merge_size=2):
+        super().__init__()
+        self.hidden_size = hidden_size * (spatial_merge_size ** 2)
+        self.norm = nn.LayerNorm(hidden_size, eps=1e-6)
+        self.linear_fc1 = nn.Linear(self.hidden_size, self.hidden_size)
+        self.act = nn.GELU()
+        self.linear_fc2 = nn.Linear(self.hidden_size, out_hidden_size)
+    def forward(self, x):
+        x = self.norm(x)
+        x = x.view(-1, self.hidden_size)
+        x = self.linear_fc2(self.act(self.linear_fc1(x)))
+        return x
+# --------------------------------------------------------------------------- #
+#  Audio input schemas
+# --------------------------------------------------------------------------- #
+class NanoAudioFeatureInputs(TensorSchema):
+    type: Literal["audio_features"]
+    input_features: Annotated[
+        torch.Tensor | list[torch.Tensor],
+        TensorShape("na", "nmb", 3000),
+    ]
+    feature_attention_mask: Annotated[
+        torch.Tensor,
+        TensorShape("na", 3000),
+    ]
+class NanoAudioEmbeddingInputs(TensorSchema):
+    type: Literal["audio_embeds"] = "audio_embeds"
+    audio_embeds: Annotated[
+        list[torch.Tensor],
+        TensorShape("bn", "naf", "hs", dynamic_dims={"naf"}),
+    ]
+NanoAudioInputs: TypeAlias = NanoAudioFeatureInputs | NanoAudioEmbeddingInputs
+def _get_feat_extract_output_lengths(input_lengths: torch.Tensor):
+    feat_lengths = (input_lengths - 1) // 2 + 1
+    output_lengths = (feat_lengths - 2) // 2 + 1
+    return feat_lengths, output_lengths
+# --------------------------------------------------------------------------- #
+#  Processing info
+# --------------------------------------------------------------------------- #
+class NanoMMAudioMultiModalDataParser(MultiModalDataParser):
+    def __init__(self, target_sr, target_channels, expected_hidden_size=None):
+        super().__init__(
+            target_sr=target_sr,
+            target_channels=target_channels,
+            expected_hidden_size=expected_hidden_size,
+        )
+    def _parse_audio_data(
+        self,
+        data: dict[str, torch.Tensor] | ModalityData[AudioItem],
+    ) -> ModalityDataItems[Any, Any] | None:
+        if isinstance(data, dict):
+            return DictEmbeddingItems(
+                data,
+                modality="audio",
+                required_fields={"audio_embeds"},
+                fields_factory=lambda hf: dict(
+                    audio_embeds=MultiModalFieldConfig.batched("audio"),
+                    input_features=MultiModalFieldConfig.batched("audio"),
+                    feature_attention_mask=MultiModalFieldConfig.batched("audio"),
+                ),
+            )
+        return super()._parse_audio_data(data)
+class NanoMMProcessingInfo(BaseProcessingInfo):
+    def get_hf_config(self):
+        return self.ctx.get_hf_config()
+    def get_feature_extractor(self, **kwargs) -> WhisperFeatureExtractor:
+        return WhisperFeatureExtractor(feature_size=128)
+    def get_data_parser(self):
+        feature_extractor = self.get_feature_extractor()
+        return NanoMMAudioMultiModalDataParser(
+            target_sr=feature_extractor.sampling_rate,
+            target_channels=1,
+            expected_hidden_size=self._get_expected_hidden_size(),
+        )
+    def get_supported_mm_limits(self) -> Mapping[str, int | None]:
+        return {"image": None, "video": None, "audio": None}
+    def get_mm_max_tokens_per_item(
+        self,
+        seq_len: int,
+        mm_counts: Mapping[str, int] | None = None,
+    ) -> Mapping[str, int]:
+        result = {}
+        mm_counts = mm_counts or {}
+        hf_config = self.get_hf_config()
+        vis_cfg = hf_config.vision_config
+        if isinstance(vis_cfg, dict):
+            spatial_merge_size = vis_cfg.get("spatial_merge_size", 2)
+        else:
+            spatial_merge_size = getattr(vis_cfg, "spatial_merge_size", 2)
+        # Always return per-item max for all modalities — vLLM calls this
+        # during profiling with empty mm_counts; a missing key is treated as 0
+        # which causes "At most 0 video(s) may be provided" errors.
+        result["image"] = 256 // (spatial_merge_size ** 2)
+        # 32-frame videos at typical resolution produce ~7040 tokens
+        # (measured: 16 frames → 3520 tokens with spatial_merge_size=1).
+        # Cap at 64 frames worth to handle evaluation edge cases.
+        result["video"] = 64 * 256 // max(spatial_merge_size ** 2, 1)
+        feature_extractor = self.get_feature_extractor()
+        chunk_length = min(feature_extractor.chunk_length, 30)
+        audio_len = int(chunk_length * feature_extractor.sampling_rate)
+        hop_length = feature_extractor.hop_length
+        max_mel_seq_len = audio_len // hop_length
+        input_lengths = torch.tensor([max_mel_seq_len], dtype=torch.long)
+        _, output_lengths = _get_feat_extract_output_lengths(input_lengths)
+        result["audio"] = int(output_lengths.item())
+        return result
+# --------------------------------------------------------------------------- #
+#  Dummy inputs builder
+# --------------------------------------------------------------------------- #
+class NanoMMDummyInputsBuilder(BaseDummyInputsBuilder[NanoMMProcessingInfo]):
+    def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
+        text = ""
+        num_images = mm_counts.get("image", 0)
+        num_videos = mm_counts.get("video", 0)
+        num_audios = mm_counts.get("audio", 0)
+        image_token = "<image>"
+        video_token = "<image>"
+        audio_token = "<|audio_bos|><|AUDIO|><|audio_eos|>"
+        text += image_token * num_images
+        text += video_token * num_videos
+        text += audio_token * num_audios
+        return text
+    def get_dummy_mm_data(
+        self,
+        seq_len: int,
+        mm_counts: Mapping[str, int],
+        mm_options: Mapping[str, BaseDummyOptions],
+    ) -> MultiModalDataDict:
+        result: dict[str, Any] = {}
+        num_images = mm_counts.get("image", 0)
+        if num_images > 0:
+            result["image"] = self._get_dummy_images(
+                width=224, height=224, num_images=num_images,
+                overrides=mm_options.get("image"),
+            )
+        num_videos = mm_counts.get("video", 0)
+        if num_videos > 0:
+            result["video"] = self._get_dummy_videos(
+                width=224, height=224, num_frames=2, num_videos=num_videos,
+                overrides=mm_options.get("video"),
+            )
+        num_audios = mm_counts.get("audio", 0)
+        if num_audios > 0:
+            feature_extractor = self.info.get_feature_extractor()
+            sampling_rate = feature_extractor.sampling_rate
+            audio_len = feature_extractor.chunk_length * sampling_rate
+            result["audio"] = self._get_dummy_audios(
+                length=audio_len, num_audios=num_audios,
+                overrides=mm_options.get("audio"),
+            )
+        return result
+# --------------------------------------------------------------------------- #
+#  Multimodal processor
+# --------------------------------------------------------------------------- #
+class NanoMMMultiModalProcessor(BaseMultiModalProcessor[NanoMMProcessingInfo]):
+    def _call_hf_processor(
+        self,
+        prompt: str,
+        mm_data: Mapping[str, object],
+        mm_kwargs: Mapping[str, Any],
+        tok_kwargs: Mapping[str, object],
+    ) -> BatchFeature:
+        if not isinstance(mm_data, dict):
+            mm_data = dict(mm_data)
+        audios = mm_data.pop("audios", [])
+        if audios:
+            mm_data["audio"] = audios
+        has_audio = bool(mm_data.get("audio", []))
+        has_images = bool(mm_data.get("images", []))
+        has_videos = bool(mm_data.get("videos", []))
+        if not has_audio and not has_images and not has_videos:
+            prompt_ids = self.info.get_tokenizer().encode(prompt)
+            prompt_ids = self._apply_hf_processor_tokens_only(prompt_ids)
+            return BatchFeature(dict(input_ids=[prompt_ids]), tensor_type="pt")
+        if has_audio and not has_images and not has_videos:
+            feature_extractor = self.info.get_feature_extractor(**mm_kwargs)
+            tokenizer = self.info.get_tokenizer()
+            audio_items = mm_data.get("audio", [])
+            if not isinstance(audio_items, list):
+                audio_items = [audio_items]
+            def _to_audio_array(item: object) -> np.ndarray:
+                if hasattr(item, "data"):
+                    item = item.data
+                if isinstance(item, tuple) and len(item) >= 1:
+                    item = item[0]
+                if isinstance(item, dict):
+                    for key in ("array", "audio", "data", "samples"):
+                        if key in item:
+                            item = item[key]
+                            break
+                if hasattr(item, "array"):
+                    item = item.array
+                if hasattr(item, "audio"):
+                    item = item.audio
+                arr = np.asarray(item, dtype=np.float32)
+                if arr.ndim > 1:
+                    arr = arr.squeeze()
+                return arr
+            processed_audio = []
+            for item in audio_items:
+                processed_audio.append(_to_audio_array(item))
+            audio_features = feature_extractor(
+                processed_audio,
+                sampling_rate=feature_extractor.sampling_rate,
+                return_tensors="pt",
+                padding="max_length",
+            )
+            max_mel_len = audio_features["input_features"].shape[-1]
+            # Keep audio prompts aligned with torch reference path
+            # (audio BOS + repeated audio token + audio EOS, no added special token).
+            prompt_ids = tokenizer.encode(prompt, add_special_tokens=False)
+            prompt_ids = self._apply_hf_processor_tokens_only(prompt_ids)
+            feature_attention_mask = torch.zeros(
+                (audio_features["input_features"].shape[0], max_mel_len),
+                dtype=torch.long,
+            )
+            feature_attention_mask[:] = 1
+            output = {
+                "input_ids": [prompt_ids],
+                "input_features": audio_features["input_features"],
+                "feature_attention_mask": feature_attention_mask,
+            }
+            return BatchFeature(output, tensor_type="pt")
+        if has_audio:
+            feature_extractor = self.info.get_feature_extractor(**mm_kwargs)
+            mm_kwargs = dict(**mm_kwargs, sampling_rate=feature_extractor.sampling_rate)
+        if has_videos:
+            mm_kwargs = dict(mm_kwargs, do_sample_frames=False)
+        return super()._call_hf_processor(
+            prompt=prompt,
+            mm_data=mm_data,
+            mm_kwargs=mm_kwargs,
+            tok_kwargs=tok_kwargs,
+        )
+    def _get_mm_fields_config(
+        self,
+        hf_inputs: BatchFeature,
+        hf_processor_mm_kwargs: Mapping[str, object],
+    ) -> Mapping[str, MultiModalFieldConfig]:
+        hf_cfg = self.info.get_hf_config()
+        spatial_merge_size = getattr(hf_cfg.vision_config, "spatial_merge_size", 2)
+        fields = dict(_create_qwen2vl_field_factory(spatial_merge_size)(hf_inputs))
+        if "input_features" in hf_inputs:
+            fields["input_features"] = MultiModalFieldConfig.batched("audio")
+        if "feature_attention_mask" in hf_inputs:
+            fields["feature_attention_mask"] = MultiModalFieldConfig.batched(
+                "audio", keep_on_cpu=True
+            )
+        if "audio_embeds" in hf_inputs:
+            fields["audio_embeds"] = MultiModalFieldConfig.batched("audio")
+        return fields
+    def _get_prompt_updates(
+        self,
+        mm_items: MultiModalDataItems,
+        hf_processor_mm_kwargs: Mapping[str, object],
+        out_mm_kwargs: MultiModalKwargsItems,
+    ) -> Sequence[PromptUpdate]:
+        updates = []
+        hf_config = self.info.get_hf_config()
+        out_mm_data = out_mm_kwargs.get_data()
+        image_token_index = getattr(hf_config, "image_token_index", None)
+        audio_token_id = getattr(hf_config, "audio_token_id", None)
+        has_image_items = any(
+            key in out_mm_data for key in ("pixel_values", "image_embeds", "image_grid_thw")
+        )
+        has_video_items = any(
+            key in out_mm_data for key in ("pixel_values_videos", "video_embeds", "video_grid_thw")
+        )
+        has_audio_items = any(
+            key in out_mm_data
+            for key in ("audio_embeds", "input_features", "feature_attention_mask")
+        )
+        spatial_merge_size = getattr(
+            hf_config.vision_config, "spatial_merge_size", 2
+        )
+        def _vision_replacement(grid_thw, item_idx: int):
+            if grid_thw is not None:
+                thw = grid_thw[item_idx]
+                t, h, w = thw.tolist() if hasattr(thw, "tolist") else (int(thw[0]), int(thw[1]), int(thw[2]))
+                n = int(t) * (int(h) // spatial_merge_size) * (int(w) // spatial_merge_size)
+            else:
+                n = 1
+            return [image_token_index] * n
+        if image_token_index is not None and has_image_items:
+            image_grid_thw = out_mm_data.get("image_grid_thw")
+            updates.append(
+                PromptReplacement(
+                    modality="image",
+                    target=[image_token_index],
+                    replacement=lambda idx: _vision_replacement(image_grid_thw, idx),
+                )
+            )
+        if image_token_index is not None and has_video_items:
+            # processing_llava_eurobert.py maps both image and video tokens to
+            # "<image>"; the prompt uses <image> for video too.
+            video_grid_thw = out_mm_data.get("video_grid_thw")
+            updates.append(
+                PromptReplacement(
+                    modality="video",
+                    target=[image_token_index],
+                    replacement=lambda idx: _vision_replacement(video_grid_thw, idx),
+                )
+            )
+        if audio_token_id is not None and has_audio_items:
+            feature_attention_mask = out_mm_data.get("feature_attention_mask")
+            if feature_attention_mask is not None:
+                assert isinstance(feature_attention_mask, torch.Tensor)
+                _, audio_output_lens = _get_feat_extract_output_lengths(
+                    feature_attention_mask.sum(-1)
+                )
+                audio_output_lengths = audio_output_lens.tolist()
+            else:
+                audio_output_lengths = []
+            def get_audio_replacement(item_idx: int):
+                if audio_output_lengths:
+                    n = audio_output_lengths[item_idx]
+                elif "audio_embeds" in out_mm_data:
+                    embeds = out_mm_data["audio_embeds"][item_idx]
+                    n = embeds.shape[0]
+                elif "input_features" in out_mm_data:
+                    raw_feats = out_mm_data["input_features"]
+                    if isinstance(raw_feats, torch.Tensor):
+                        feats = raw_feats[item_idx]
+                    else:
+                        feat_item = raw_feats[item_idx]
+                        feats = feat_item.data if hasattr(feat_item, "data") else feat_item
+                    feature_len = int(feats.shape[-1])
+                    _, output_lengths = _get_feat_extract_output_lengths(
+                        torch.tensor([feature_len], dtype=torch.long)
+                    )
+                    n = int(output_lengths.item())
+                else:
+                    n = 1
+                return [audio_token_id] * n
+            updates.append(
+                PromptReplacement(
+                    modality="audio",
+                    target=[audio_token_id],
+                    replacement=get_audio_replacement,
+                )
+            )
+        return updates
+# --------------------------------------------------------------------------- #
+#  Model
+# --------------------------------------------------------------------------- #
+@MULTIMODAL_REGISTRY.register_processor(
+    NanoMMMultiModalProcessor,
+    info=NanoMMProcessingInfo,
+    dummy_inputs=NanoMMDummyInputsBuilder,
+)
+class LlavaEuroBertAudioForVLLMEmbedding(nn.Module, SupportsMultiModal, SupportsPP):
+    """vLLM model for LlavaEuroBertAudioForEmbedding (nano multimodal embedding)."""
+    @classmethod
+    def get_placeholder_str(cls, modality: str, i: int) -> str | None:
+        if modality == "image":
+            return "<image>"
+        if modality == "video":
+            return "<image>"
+        if modality.startswith("audio"):
+            return f"Audio {i}: <|audio_bos|><|AUDIO|><|audio_eos|>"
+        return None
+    def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
+        super().__init__()
+        config = vllm_config.model_config.hf_config
+        self.config = config
+        self.audio_token_id = getattr(config, "audio_token_id", None)
+        vis_cfg = config.vision_config
+        if not isinstance(vis_cfg, Qwen3VLVisionConfig):
+            if hasattr(vis_cfg, "to_dict"):
+                d = vis_cfg.to_dict()
+            else:
+                d = dict(vis_cfg)
+            d.pop("model_type", None)
+            d.pop("transformers_version", None)
+            vis_cfg = Qwen3VLVisionConfig(**d)
+        vis_cfg.deepstack_visual_indexes = []
+        txt_cfg = config.text_config
+        if isinstance(txt_cfg, dict):
+            from transformers import LlamaConfig
+            txt_cfg = LlamaConfig(**txt_cfg)
+        # transformers 5.x aliases rope_scaling and rope_parameters via the
+        # same proxy. Assigning rope_scaling = None silently nulls
+        # rope_parameters too, making vLLM's get_rope fall back to
+        # base=10000 (default Llama) instead of the model's rope_theta.
+        # Set only rope_theta + rope_parameters; never touch rope_scaling.
+        rope_params = getattr(txt_cfg, "rope_parameters", None)
+        if rope_params:
+            rope_theta = float(rope_params.get("rope_theta", 10000.0))
+            clean = dict(rope_params)
+            clean["rope_theta"] = rope_theta
+            txt_cfg.rope_theta = rope_theta
+            txt_cfg.rope_parameters = clean
+        text_hidden = txt_cfg.hidden_size
+        aud_cfg = config.audio_config
+        if isinstance(aud_cfg, dict):
+            aud_cfg = Qwen2_5OmniAudioEncoderConfig(**aud_cfg)
+        spatial_merge_size = getattr(vis_cfg, "spatial_merge_size", 2)
+        self._spatial_merge_size = spatial_merge_size
+        with self._mark_tower_model(vllm_config, {"image", "video"}):
+            self.vision_tower = Qwen3VLVisionModel(vis_cfg)
+            self.vision_tower.merger = nn.Identity()
+            self.vision_tower.deepstack_merger_list = nn.ModuleList()
+            self.vision_tower.deepstack_visual_indexes = []
+            self.merger = PretrainedMerger(
+                vis_cfg.hidden_size, text_hidden, spatial_merge_size
+            )
+        with self._mark_tower_model(vllm_config, "audio"):
+            self.audio_tower = Qwen2_5OmniAudioEncoder(aud_cfg)
+            self.audio_tower.proj = nn.Identity()  # fused into audio_projector
+            d_model = getattr(aud_cfg, "d_model", 1280)
+            self.audio_projector = nn.Linear(d_model, text_hidden)
+        self.multi_modal_projector = nn.Identity()
+        self.lm_head = nn.Identity()
+        with self._mark_language_model(vllm_config):
+            self.language_model = init_vllm_registered_model(
+                vllm_config=vllm_config,
+                hf_config=txt_cfg,
+                prefix=maybe_prefix(prefix, "language_model"),
+                architectures=["LlamaBidirectionalModel"],
+            )
+        self.make_empty_intermediate_tensors = (
+            self.language_model.make_empty_intermediate_tensors
+        )
+        self._init_audio_alignment(text_hidden, vllm_config)
+        # Default audio prompt path expands to:
+        # <audio_bos> + 750 audio tokens + <audio_eos> = 752 tokens.
+        self._audio_default_seq_len = 752
+        # Set in embed_multimodal when the batch contains audio; consumed and
+        # cleared in forward so the seq_len fallback never fires for text-only.
+        self._pending_audio_in_batch = False
+    def _init_audio_alignment(self, hidden_size: int, vllm_config: VllmConfig) -> None:
+        if self.audio_token_id is None:
+            return
+        if os.getenv("JINA_OMNI_DISABLE_AUDIO_ALIGNMENT") == "1":
+            return
+        candidate_paths: list[Path] = [Path(__file__).with_name("audio_linear_alignment.pt")]
+        model_path = getattr(vllm_config.model_config, "model", None)
+        if isinstance(model_path, str):
+            candidate_paths.append(Path(model_path) / "audio_linear_alignment.pt")
+        alignment_path = next((p for p in candidate_paths if p.exists()), None)
+        if alignment_path is None and isinstance(model_path, str) and "/" in model_path:
+            try:
+                from huggingface_hub import hf_hub_download
+                alignment_path = Path(hf_hub_download(
+                    model_path, "audio_linear_alignment.pt",
+                ))
+            except Exception:
+                pass
+        if alignment_path is None:
+            return
+        payload = torch.load(alignment_path, map_location="cpu")
+        matrix = payload.get("W") if isinstance(payload, dict) else payload
+        if not isinstance(matrix, torch.Tensor):
+            return
+        if matrix.ndim != 2:
+            return
+        if matrix.shape[0] != hidden_size or matrix.shape[1] != hidden_size:
+            return
+        self.register_buffer(
+            "audio_linear_alignment", matrix.to(torch.float32), persistent=False
+        )
+    def _apply_audio_alignment(
+        self,
+        hidden_states: torch.Tensor,
+        input_ids: torch.Tensor | None,
+        positions: torch.Tensor | None,
+        has_audio: bool = False,
+    ) -> torch.Tensor:
+        alignment_matrix = getattr(self, "audio_linear_alignment", None)
+        if alignment_matrix is None:
+            return hidden_states
+        if positions is None:
+            return hidden_states
+        flat_positions = positions.reshape(-1)
+        if flat_positions.shape[0] != hidden_states.shape[0]:
+            return hidden_states
+        flat_input_ids = input_ids.reshape(-1) if input_ids is not None else None
+        seq_starts = torch.nonzero(flat_positions.eq(0), as_tuple=False).flatten()
+        if seq_starts.numel() == 0:
+            seq_starts = flat_positions.new_tensor([0])
+        elif seq_starts[0].item() != 0:
+            seq_starts = torch.cat([flat_positions.new_tensor([0]), seq_starts], dim=0)
+        seq_ends = torch.cat(
+            [seq_starts[1:], flat_positions.new_tensor([flat_positions.numel()])], dim=0
+        )
+        alignment_matrix = alignment_matrix.to(
+            device=hidden_states.device, dtype=torch.float32
+        )
+        aligned_hidden_states = hidden_states.float()
+        for start, end in zip(seq_starts.tolist(), seq_ends.tolist()):
+            seq_len = end - start
+            apply_alignment = False
+            if flat_input_ids is not None and self.audio_token_id is not None:
+                apply_alignment = bool(torch.any(flat_input_ids[start:end].eq(self.audio_token_id)))
+            elif has_audio:
+                # vLLM pooling runner passes only inputs_embeds (input_ids=None).
+                # Only trust the default-length marker when embed_multimodal
+                # actually processed audio for this batch — otherwise a text
+                # prompt that happens to pack to 752 tokens would be poisoned.
+                apply_alignment = seq_len == self._audio_default_seq_len
+            if apply_alignment:
+                aligned_hidden_states[start:end] = aligned_hidden_states[start:end] @ alignment_matrix
+        return aligned_hidden_states.to(hidden_states.dtype)
+    # ---- vision processing ---- #
+    def _process_image_input(self, pixel_values, image_grid_thw):
+        vision_output = self.vision_tower(
+            hidden_states=pixel_values, grid_thw=image_grid_thw
+        )
+        raw_hidden = vision_output[0] if isinstance(vision_output, tuple) else vision_output[0]
+        image_features = self.merger(raw_hidden)
+        merge = self._spatial_merge_size
+        tokens_per_image = []
+        if isinstance(image_grid_thw, list):
+            for t, h, w in image_grid_thw:
+                n = int(t) * (int(h) // merge) * (int(w) // merge)
+                tokens_per_image.append(n)
+        else:
+            for i in range(image_grid_thw.shape[0]):
+                t, h, w = image_grid_thw[i].tolist()
+                n = int(t) * (int(h) // merge) * (int(w) // merge)
+                tokens_per_image.append(n)
+        per_image_features = []
+        offset = 0
+        for n in tokens_per_image:
+            feat = image_features[offset : offset + n]
+            per_image_features.append(feat)
+            offset += n
+        return per_image_features
+    # ---- audio processing ---- #
+    def _parse_and_validate_audio_input(
+        self, **kwargs: object
+    ) -> NanoAudioInputs | None:
+        input_features = kwargs.pop("input_features", None)
+        audio_embeds = kwargs.pop("audio_embeds", None)
+        feature_attention_mask = kwargs.pop("feature_attention_mask", None)
+        if input_features is None and audio_embeds is None:
+            return None
+        if audio_embeds is not None:
+            return NanoAudioEmbeddingInputs(
+                type="audio_embeds", audio_embeds=audio_embeds
+            )
+        return NanoAudioFeatureInputs(
+            type="audio_features",
+            input_features=input_features,
+            feature_attention_mask=feature_attention_mask,
+        )
+    def _process_audio_input(
+        self, audio_input: NanoAudioInputs
+    ) -> torch.Tensor | tuple[torch.Tensor, ...]:
+        if audio_input["type"] == "audio_embeds":
+            return tuple(audio_input["audio_embeds"])
+        input_features = audio_input["input_features"]
+        feature_attention_mask = audio_input["feature_attention_mask"]
+        feature_lens = feature_attention_mask.sum(-1).long()
+        aftercnn_lens, output_lengths = (
+            self.audio_tower._get_feat_extract_output_lengths(feature_lens)
+        )
+        packed = input_features.permute(0, 2, 1)[feature_attention_mask.bool()].permute(1, 0)
+        audio_outputs = self.audio_tower(
+            packed, feature_lens=feature_lens, aftercnn_lens=aftercnn_lens,
+        )
+        audio_features = self.audio_projector(audio_outputs.last_hidden_state)
+        return torch.split(audio_features, output_lengths.tolist())
+    # ---- embed_multimodal ---- #
+    def embed_multimodal(self, **kwargs: object) -> MultiModalEmbeddings:
+        embeddings: list[torch.Tensor] = []
+        pixel_values = kwargs.pop("pixel_values", None)
+        image_grid_thw = kwargs.pop("image_grid_thw", None)
+        if pixel_values is not None and image_grid_thw is not None:
+            embeddings.extend(
+                self._process_image_input(pixel_values, image_grid_thw)
+            )
+        pixel_values_videos = kwargs.pop("pixel_values_videos", None)
+        video_grid_thw = kwargs.pop("video_grid_thw", None)
+        kwargs.pop("timestamps", None)
+        if pixel_values_videos is not None and video_grid_thw is not None:
+            embeddings.extend(
+                self._process_image_input(pixel_values_videos, video_grid_thw)
+            )
+        audio_input = self._parse_and_validate_audio_input(**kwargs)
+        if audio_input is not None:
+            self._pending_audio_in_batch = True
+            audio_embeds = self._process_audio_input(audio_input)
+            if isinstance(audio_embeds, tuple):
+                embeddings.extend(audio_embeds)
+            else:
+                embeddings.append(audio_embeds)
+        return embeddings if embeddings else []
+    # ---- forward ---- #
+    def forward(
+        self,
+        input_ids: torch.Tensor | None,
+        positions: torch.Tensor,
+        intermediate_tensors: IntermediateTensors | None = None,
+        inputs_embeds: torch.Tensor | None = None,
+        **kwargs: object,
+    ) -> torch.Tensor | IntermediateTensors:
+        if intermediate_tensors is not None:
+            inputs_embeds = None
+        has_audio = self._pending_audio_in_batch
+        self._pending_audio_in_batch = False
+        hidden_states = self.language_model.model(
+            input_ids, positions, intermediate_tensors, inputs_embeds=inputs_embeds
+        )
+        hidden_states = self._apply_audio_alignment(
+            hidden_states, input_ids, positions, has_audio=has_audio
+        )
+        return hidden_states
+    def compute_logits(self, hidden_states: torch.Tensor) -> torch.Tensor | None:
+        return self.language_model.compute_logits(hidden_states)
+    # ---- weight loading ---- #
+    @staticmethod
+    def _remap_weights(
+        weights: Iterable[tuple[str, torch.Tensor]],
+    ) -> Iterable[tuple[str, torch.Tensor]]:
+        for name, tensor in weights:
+            if name.startswith("language_model.") and not name.startswith(
+                "language_model.model."
+            ) and not name.startswith("language_model.lm_head."):
+                name = "language_model.model." + name[len("language_model."):]
+            yield name, tensor
+    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        loader = AutoWeightsLoader(self)
+        return loader.load_weights(self._remap_weights(weights))
+_IMAGE_TOKEN = "<image>"
+_IMAGE_PLACEHOLDER = "<image>"
+_VIDEO_PLACEHOLDER = "<image>"
+_AUDIO_PLACEHOLDER = "<|audio_start|><|audio_pad|><|audio_end|>"
+def _image_chat_prompt(text: str = "") -> str:
+    return f"<|im_start|>user\n<|vision_start|>{_IMAGE_TOKEN}<|vision_end|>{text}<|im_end|>\n"
+def format_prompt(text: str = "", image=None, video=None, audio=None) -> dict:
+    """Build a `llm.embed(...)` request dict for jina-embeddings-v5-omni-nano.
+    Inserts the model's vision/audio placeholder tokens for you so callers
+    don't need to spell them out.
+    For audio, also pass ``tokenization_kwargs={"add_special_tokens": False}``
+    to ``llm.embed`` so that LAST-token pooling lands on `<|audio_end|>` rather
+    than the tokenizer's auto-appended `<|end_of_text|>`.
+    """
+    if image is not None and video is None and audio is None:
+        return {"prompt": _image_chat_prompt(text), "multi_modal_data": {"image": image}}
+    parts: list[str] = []
+    mm: dict = {}
+    if image is not None:
+        parts.append(_IMAGE_PLACEHOLDER)
+        mm["image"] = image
+    if video is not None:
+        parts.append(_VIDEO_PLACEHOLDER)
+        mm["video"] = video
+    if audio is not None:
+        parts.append(_AUDIO_PLACEHOLDER)
+        mm["audio"] = audio
+    req: dict = {"prompt": "".join(parts) + text}
+    if mm:
+        req["multi_modal_data"] = mm
+    return req
+import sys as _sys
+_sys.modules.setdefault("jina_v5_omni", _sys.modules[__name__])