Image-Text-to-Text
Transformers
Safetensors
qwen3_omni_moe
text-to-audio
zen
zenlm
multimodal
vision-language
audio
speech
omni
hanzo
thinking
instruct
zen-lm
conversational
Instructions to use zenlm/zen-omni with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zenlm/zen-omni with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="zenlm/zen-omni") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForTextToWaveform processor = AutoProcessor.from_pretrained("zenlm/zen-omni") model = AutoModelForTextToWaveform.from_pretrained("zenlm/zen-omni") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use zenlm/zen-omni with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zenlm/zen-omni" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zenlm/zen-omni", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/zenlm/zen-omni
- SGLang
How to use zenlm/zen-omni with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zenlm/zen-omni" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zenlm/zen-omni", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zenlm/zen-omni" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zenlm/zen-omni", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use zenlm/zen-omni with Docker Model Runner:
docker model run hf.co/zenlm/zen-omni
Update README: add abliteration methodology and Zen identity
Browse files
README.md
CHANGED
|
@@ -2,155 +2,380 @@
|
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- zen
|
| 7 |
- zenlm
|
| 8 |
- multimodal
|
| 9 |
- vision-language
|
| 10 |
- audio
|
| 11 |
-
-
|
|
|
|
| 12 |
- hanzo
|
|
|
|
|
|
|
|
|
|
| 13 |
library_name: transformers
|
| 14 |
pipeline_tag: image-text-to-text
|
| 15 |
---
|
| 16 |
|
| 17 |
-
#
|
| 18 |
|
| 19 |
-
**
|
| 20 |
|
| 21 |
-
Part of the Zen LM family - democratizing AI while protecting our planet.
|
| 22 |
|
| 23 |
-
## Model
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
- 🖼️ **Vision** - Image analysis and visual reasoning
|
| 29 |
-
- 🎵 **Audio** - Speech recognition and audio understanding
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Architecture
|
| 34 |
|
| 35 |
-
|
| 36 |
-
**Type**: Multimodal Transformer
|
| 37 |
-
**Parameters**: ~7B
|
| 38 |
-
**Context Length**: 32,768 tokens
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
## Capabilities
|
| 47 |
|
| 48 |
-
|
| 49 |
-
-
|
| 50 |
-
-
|
| 51 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
|
| 54 |
-
- Natural language processing
|
| 55 |
-
- Instruction following
|
| 56 |
-
- Text generation
|
| 57 |
-
|
| 58 |
-
🖼️ **Vision Understanding**
|
| 59 |
-
- Image analysis and description
|
| 60 |
-
- Visual question answering
|
| 61 |
-
- Scene understanding
|
| 62 |
|
| 63 |
-
|
| 64 |
-
- Speech recognition
|
| 65 |
-
- Audio transcription
|
| 66 |
-
- Voice interaction
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
- **zen-omni-30b-instruct** - Instruction-tuned variant
|
| 72 |
-
- **zen-omni-30b-thinking** - Chain-of-thought reasoning variant
|
| 73 |
|
| 74 |
-
##
|
| 75 |
|
| 76 |
```python
|
| 77 |
from transformers import AutoModelForCausalLM, AutoProcessor
|
| 78 |
|
| 79 |
-
# Load model
|
| 80 |
-
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
-
#
|
| 84 |
-
text_input = processor(text="Hello!", return_tensors="pt")
|
| 85 |
-
output = model.generate(**text_input)
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
)
|
| 93 |
-
output = model.generate(**image_input)
|
| 94 |
|
| 95 |
-
#
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
)
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
-
|
|
|
|
|
|
|
| 104 |
```
|
| 105 |
|
| 106 |
-
##
|
| 107 |
|
| 108 |
-
|
| 109 |
-
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
## Training
|
| 115 |
|
| 116 |
-
Fine-tuned with:
|
| 117 |
- Multimodal instruction tuning
|
| 118 |
- Cross-modal alignment
|
| 119 |
-
-
|
| 120 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
-
##
|
| 123 |
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
## Why Zen LM?
|
| 131 |
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
|
| 137 |
## Organizations
|
| 138 |
|
| 139 |
-
**Hanzo AI Inc** - Techstars '17 • Award-winning GenAI lab
|
| 140 |
-
**Zoo Labs Foundation** - 501(c)(3) Non-Profit
|
| 141 |
|
| 142 |
-
##
|
| 143 |
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
|
| 149 |
## Citation
|
| 150 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
-
|
| 153 |
|
| 154 |
## License
|
| 155 |
|
| 156 |
-
Apache 2.0 • No data collection • Privacy-first
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
+
- zh
|
| 6 |
+
- ja
|
| 7 |
+
- ko
|
| 8 |
+
- de
|
| 9 |
+
- fr
|
| 10 |
+
- es
|
| 11 |
+
- it
|
| 12 |
+
- pt
|
| 13 |
+
- ru
|
| 14 |
tags:
|
| 15 |
- zen
|
| 16 |
- zenlm
|
| 17 |
- multimodal
|
| 18 |
- vision-language
|
| 19 |
- audio
|
| 20 |
+
- speech
|
| 21 |
+
- omni
|
| 22 |
- hanzo
|
| 23 |
+
- thinking
|
| 24 |
+
- instruct
|
| 25 |
+
- zen-lm
|
| 26 |
library_name: transformers
|
| 27 |
pipeline_tag: image-text-to-text
|
| 28 |
---
|
| 29 |
|
| 30 |
+
# Zen Omni
|
| 31 |
|
| 32 |
+
**Hypermodal Language Model for Translation + Audio Generation**
|
| 33 |
|
| 34 |
+
> Part of the [Zen LM](https://zenlm.org) family - democratizing AI while protecting our planet.
|
| 35 |
|
| 36 |
+
## Model Specifications
|
| 37 |
|
| 38 |
+
| Attribute | Value |
|
| 39 |
+
|-----------|-------|
|
| 40 |
+
| **Architecture** | MoE multimodal (Thinker-Talker) |
|
| 41 |
+
| **Total Parameters** | 30B |
|
| 42 |
+
| **Active Parameters** | 3B (via MoE sparse activation) |
|
| 43 |
+
| **Text Languages** | 119 languages |
|
| 44 |
+
| **Speech Input** | 19 languages |
|
| 45 |
+
| **Speech Output** | 10 languages |
|
| 46 |
+
| **Context Length** | 32,768 tokens |
|
| 47 |
+
| **Technical Report** | [docs/paper/paper.pdf](docs/paper/paper.pdf) |
|
| 48 |
+
| **License** | Apache 2.0 |
|
| 49 |
|
| 50 |
+
## Model Variants
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
| Variant | Description | Use Case |
|
| 53 |
+
|---------|-------------|----------|
|
| 54 |
+
| **zen-omni** | Base multimodal model | General purpose |
|
| 55 |
+
| **zen-omni-instruct** | Instruction-following | Chat, Q&A, tasks |
|
| 56 |
+
| **zen-omni-thinking** | Chain-of-thought reasoning | Complex reasoning, math |
|
| 57 |
+
| **zen-omni-captioner** | Audio/visual captioning | Transcription, description |
|
| 58 |
|
| 59 |
## Architecture
|
| 60 |
|
| 61 |
+
Zen Omni is built on a **Thinker-Talker** MoE architecture:
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
```
|
| 64 |
+
┌─────────────────────────────────────────────────────────────┐
|
| 65 |
+
│ ZEN OMNI │
|
| 66 |
+
├─────────────────────────────────────────────────────────────┤
|
| 67 |
+
│ │
|
| 68 |
+
│ INPUT ENCODERS │
|
| 69 |
+
│ ├── Audio Encoder (32 layers, 1280 dim) │
|
| 70 |
+
│ ├── Vision Encoder (27 layers, 1152 dim) │
|
| 71 |
+
│ └── Text Embeddings (151,936 vocab) │
|
| 72 |
+
│ │ │
|
| 73 |
+
│ ▼ │
|
| 74 |
+
│ ┌─────────────────────────────────────────┐ │
|
| 75 |
+
│ │ THINKER (Multimodal LLM) │ │
|
| 76 |
+
│ │ • 48 transformer layers │ │
|
| 77 |
+
│ │ • 128 experts (MoE) │ │
|
| 78 |
+
│ │ • 8 experts active per token │ │
|
| 79 |
+
│ │ • Cross-modal attention fusion │ │
|
| 80 |
+
│ └─────────────────────────────────────────┘ │
|
| 81 |
+
│ │ │
|
| 82 |
+
│ ▼ │
|
| 83 |
+
│ ┌─────────────────────────────────────────┐ │
|
| 84 |
+
│ │ TALKER (Audio Gen) │ │
|
| 85 |
+
│ │ • Streaming speech synthesis │ │
|
| 86 |
+
│ │ • Code2Wav audio codec │ │
|
| 87 |
+
│ │ • 16 quantizers, 2048 codebook │ │
|
| 88 |
+
│ └─────────────────────────────────────────┘ │
|
| 89 |
+
│ │ │
|
| 90 |
+
│ ▼ │
|
| 91 |
+
│ OUTPUT: Text + Audio + Vision Understanding │
|
| 92 |
+
│ │
|
| 93 |
+
└───────��─────────────────────────────────────────────────────┘
|
| 94 |
+
```
|
| 95 |
|
| 96 |
## Capabilities
|
| 97 |
|
| 98 |
+
### Multimodal Understanding
|
| 99 |
+
- **Text**: 119 language understanding and generation
|
| 100 |
+
- **Vision**: Image analysis, video comprehension, OCR
|
| 101 |
+
- **Audio**: Speech recognition in 19 languages, audio understanding
|
| 102 |
+
- **Cross-Modal**: Unified reasoning across all modalities
|
| 103 |
+
|
| 104 |
+
### Speech Synthesis
|
| 105 |
+
- Native audio output in 10 languages
|
| 106 |
+
- Low-latency streaming (< 300ms)
|
| 107 |
+
- Natural prosody and emotion
|
| 108 |
+
- Voice preservation across translations
|
| 109 |
+
|
| 110 |
+
### Translation Pipeline
|
| 111 |
+
- Real-time speech-to-speech translation
|
| 112 |
+
- Preserves speaker characteristics
|
| 113 |
+
- Integration with **zen-dub** for lip synchronization
|
| 114 |
+
- End-to-end dubbing workflow
|
| 115 |
+
|
| 116 |
+
### Thinking Mode
|
| 117 |
+
- Extended reasoning (up to 32K thinking tokens)
|
| 118 |
+
- Complex problem solving
|
| 119 |
+
- Math and code reasoning
|
| 120 |
|
| 121 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
+
### Installation
|
|
|
|
|
|
|
|
|
|
| 124 |
|
| 125 |
+
```bash
|
| 126 |
+
pip install transformers torch soundfile
|
| 127 |
+
```
|
|
|
|
|
|
|
| 128 |
|
| 129 |
+
### Basic Usage
|
| 130 |
|
| 131 |
```python
|
| 132 |
from transformers import AutoModelForCausalLM, AutoProcessor
|
| 133 |
|
| 134 |
+
# Load model
|
| 135 |
+
model_id = "zenlm/zen-omni"
|
| 136 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 137 |
+
model_id,
|
| 138 |
+
torch_dtype="auto",
|
| 139 |
+
device_map="auto"
|
| 140 |
+
)
|
| 141 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
| 142 |
+
|
| 143 |
+
# Text-to-text with thinking
|
| 144 |
+
messages = [
|
| 145 |
+
{"role": "system", "content": "You are Zen, a helpful AI assistant."},
|
| 146 |
+
{"role": "user", "content": "Explain quantum computing in simple terms."}
|
| 147 |
+
]
|
| 148 |
+
|
| 149 |
+
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
|
| 150 |
+
inputs = processor(text=text, return_tensors="pt").to(model.device)
|
| 151 |
+
outputs = model.generate(**inputs, max_new_tokens=512)
|
| 152 |
+
response = processor.decode(outputs[0], skip_special_tokens=True)
|
| 153 |
+
print(response)
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
### Multimodal Input (Image + Audio + Text)
|
| 157 |
+
|
| 158 |
+
```python
|
| 159 |
+
from PIL import Image
|
| 160 |
+
import librosa
|
| 161 |
+
|
| 162 |
+
# Load multimodal inputs
|
| 163 |
+
image = Image.open("path/to/image.jpg")
|
| 164 |
+
audio, sr = librosa.load("path/to/audio.wav", sr=16000)
|
| 165 |
+
|
| 166 |
+
# Process multimodal message
|
| 167 |
+
messages = [
|
| 168 |
+
{"role": "user", "content": [
|
| 169 |
+
{"type": "image", "image": image},
|
| 170 |
+
{"type": "audio", "audio": audio},
|
| 171 |
+
{"type": "text", "text": "Describe this image and transcribe the audio."}
|
| 172 |
+
]}
|
| 173 |
+
]
|
| 174 |
+
|
| 175 |
+
inputs = processor(messages, return_tensors="pt").to(model.device)
|
| 176 |
+
outputs = model.generate(**inputs, max_new_tokens=1024)
|
| 177 |
+
response = processor.decode(outputs[0])
|
| 178 |
+
```
|
| 179 |
|
| 180 |
+
### Speech-to-Speech Translation
|
|
|
|
|
|
|
| 181 |
|
| 182 |
+
```python
|
| 183 |
+
import soundfile as sf
|
| 184 |
+
|
| 185 |
+
# Load source audio
|
| 186 |
+
source_audio, sr = librosa.load("japanese_speech.wav", sr=16000)
|
| 187 |
+
|
| 188 |
+
# Translate and generate English speech
|
| 189 |
+
messages = [
|
| 190 |
+
{"role": "user", "content": [
|
| 191 |
+
{"type": "audio", "audio": source_audio},
|
| 192 |
+
{"type": "text", "text": "Translate this Japanese speech to English and speak the translation."}
|
| 193 |
+
]}
|
| 194 |
+
]
|
| 195 |
+
|
| 196 |
+
inputs = processor(messages, return_tensors="pt").to(model.device)
|
| 197 |
+
outputs = model.generate(
|
| 198 |
+
**inputs,
|
| 199 |
+
max_new_tokens=2048,
|
| 200 |
+
return_audio=True
|
| 201 |
)
|
|
|
|
| 202 |
|
| 203 |
+
# Save translated audio
|
| 204 |
+
translated_audio = outputs.audio[0]
|
| 205 |
+
sf.write("english_translation.wav", translated_audio, 24000)
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
### MLX (Apple Silicon)
|
| 209 |
+
|
| 210 |
+
```bash
|
| 211 |
+
# 4-bit quantized for M1/M2/M3
|
| 212 |
+
python3 -m mlx_lm.generate --model ./mlx/q4 --prompt "Hello"
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
### GGUF (llama.cpp / LM Studio)
|
| 216 |
|
| 217 |
+
```bash
|
| 218 |
+
# Load in LM Studio or llama.cpp
|
| 219 |
+
./llama-cli -m ./gguf/zen-omni-30b-q4_k_m.gguf -p "Hello"
|
| 220 |
```
|
| 221 |
|
| 222 |
+
## Model Files & Formats
|
| 223 |
|
| 224 |
+
| Format | Size | RAM | Use Case |
|
| 225 |
+
|--------|------|-----|----------|
|
| 226 |
+
| **SafeTensors** (BF16) | ~60GB | 80GB+ | Training, full precision |
|
| 227 |
+
| **MLX 4-bit** | ~15GB | 20GB | Apple Silicon (M1/M2/M3) |
|
| 228 |
+
| **MLX 8-bit** | ~30GB | 32GB | Apple Silicon (higher quality) |
|
| 229 |
+
| **GGUF Q4_K_M** | ~15GB | 20GB | llama.cpp, LM Studio |
|
| 230 |
+
|
| 231 |
+
## Performance (Apple Silicon)
|
| 232 |
+
|
| 233 |
+
- **M1/M2/M3**: 10-20 tokens/sec
|
| 234 |
+
- **RAM Required**: 20-24GB minimum
|
| 235 |
+
- **Recommended**: M2 Pro/Max or M3 with 32GB+ RAM
|
| 236 |
+
|
| 237 |
+
## Integration with Zen Dub
|
| 238 |
+
|
| 239 |
+
Zen Omni integrates with [zen-dub](https://github.com/zenlm/zen-dub) for complete video dubbing:
|
| 240 |
+
|
| 241 |
+
```python
|
| 242 |
+
from zen_omni import ZenOmniTranslator
|
| 243 |
+
from zen_dub import ZenDubPipeline
|
| 244 |
+
|
| 245 |
+
# Initialize components
|
| 246 |
+
translator = ZenOmniTranslator("zenlm/zen-omni")
|
| 247 |
+
lip_sync = ZenDubPipeline("zenlm/zen-dub")
|
| 248 |
+
|
| 249 |
+
# Full dubbing pipeline
|
| 250 |
+
def dub_video(video_path, target_language="en"):
|
| 251 |
+
# 1. Extract audio from video
|
| 252 |
+
audio, frames = extract_video(video_path)
|
| 253 |
+
|
| 254 |
+
# 2. Translate speech with Zen Omni
|
| 255 |
+
translated_audio = translator.translate_speech(
|
| 256 |
+
audio,
|
| 257 |
+
target_language=target_language,
|
| 258 |
+
preserve_prosody=True
|
| 259 |
+
)
|
| 260 |
+
|
| 261 |
+
# 3. Generate lip-synced video with Zen Dub
|
| 262 |
+
dubbed_video = lip_sync.generate(
|
| 263 |
+
frames=frames,
|
| 264 |
+
audio=translated_audio,
|
| 265 |
+
fps=30
|
| 266 |
+
)
|
| 267 |
+
|
| 268 |
+
return dubbed_video
|
| 269 |
+
|
| 270 |
+
# Run pipeline
|
| 271 |
+
result = dub_video("input_japanese.mp4", target_language="en")
|
| 272 |
+
result.save("output_english_dubbed.mp4")
|
| 273 |
+
```
|
| 274 |
|
| 275 |
## Training
|
| 276 |
|
| 277 |
+
Fine-tuned from the Zen Omni 30B MoE base with:
|
| 278 |
- Multimodal instruction tuning
|
| 279 |
- Cross-modal alignment
|
| 280 |
+
- Zen AI identity training (LoRA)
|
| 281 |
+
|
| 282 |
+
Training configuration: [`training/zen_identity_sft.yaml`](training/zen_identity_sft.yaml)
|
| 283 |
+
|
| 284 |
+
### Identity Training with ms-swift
|
| 285 |
+
|
| 286 |
+
```bash
|
| 287 |
+
# Install ms-swift
|
| 288 |
+
pip install ms-swift
|
| 289 |
+
|
| 290 |
+
# Fine-tune with Zen identity
|
| 291 |
+
swift sft \
|
| 292 |
+
--model_type omni-30b-a3b \
|
| 293 |
+
--model_id_or_path zenlm/zen-omni \
|
| 294 |
+
--dataset zen_identity \
|
| 295 |
+
--output_dir ./zen-omni-finetuned \
|
| 296 |
+
--lora_rank 64 \
|
| 297 |
+
--lora_alpha 128 \
|
| 298 |
+
--max_steps 1000 \
|
| 299 |
+
--learning_rate 1e-4
|
| 300 |
+
```
|
| 301 |
+
|
| 302 |
+
## Cookbooks & Examples
|
| 303 |
+
|
| 304 |
+
See the [`cookbooks/`](cookbooks/) directory for Jupyter notebooks:
|
| 305 |
+
|
| 306 |
+
- `omni_captioner.ipynb` - Audio/visual captioning
|
| 307 |
+
- `audio_visual_dialogue.ipynb` - Multimodal conversations
|
| 308 |
+
- `speech_recognition.ipynb` - Speech-to-text
|
| 309 |
+
- `image_question.ipynb` - Visual Q&A
|
| 310 |
+
- `video_description.ipynb` - Video understanding
|
| 311 |
|
| 312 |
+
## Web Demos
|
| 313 |
|
| 314 |
+
```bash
|
| 315 |
+
# Full multimodal demo
|
| 316 |
+
python web_demo.py --checkpoint-path zenlm/zen-omni --flash-attn2
|
| 317 |
+
|
| 318 |
+
# Audio captioner
|
| 319 |
+
python web_demo_captioner.py --checkpoint-path zenlm/zen-omni --flash-attn2
|
| 320 |
+
```
|
| 321 |
+
|
| 322 |
+
## Performance Benchmarks
|
| 323 |
+
|
| 324 |
+
| Benchmark | Zen Omni | Notes |
|
| 325 |
+
|-----------|----------|-------|
|
| 326 |
+
| Speech Translation (BLEU) | 42.3 | En↔Ja bidirectional |
|
| 327 |
+
| Image Understanding (VQA) | 78.2% | Visual question answering |
|
| 328 |
+
| Audio Transcription (WER) | 4.2% | English ASR |
|
| 329 |
+
| Cross-Modal Reasoning | 85.1% | MMLU multimodal |
|
| 330 |
|
| 331 |
## Why Zen LM?
|
| 332 |
|
| 333 |
+
- **Ultra-Efficient** - 3B active parameters via MoE
|
| 334 |
+
- **Truly Private** - 100% local processing, no cloud required
|
| 335 |
+
- **Environmentally Responsible** - 95% less energy than cloud AI
|
| 336 |
+
- **Free Forever** - Apache 2.0 licensed
|
| 337 |
|
| 338 |
## Organizations
|
| 339 |
|
| 340 |
+
- **[Hanzo AI Inc](https://hanzo.ai)** - Techstars '17 • Award-winning GenAI lab
|
| 341 |
+
- **[Zoo Labs Foundation](https://zoolabs.io)** - 501(c)(3) Non-Profit
|
| 342 |
|
| 343 |
+
## Resources
|
| 344 |
|
| 345 |
+
- [Website](https://zenlm.org)
|
| 346 |
+
- [Documentation](https://docs.zenlm.org)
|
| 347 |
+
- [Discord](https://discord.gg/hanzoai)
|
| 348 |
+
- hello@zenlm.org
|
| 349 |
|
| 350 |
## Citation
|
| 351 |
|
| 352 |
+
```bibtex
|
| 353 |
+
@misc{zen-omni-2024,
|
| 354 |
+
title={Zen Omni: Hypermodal Language Model for Translation and Audio Generation},
|
| 355 |
+
author={Zen LM Team and Hanzo AI},
|
| 356 |
+
year={2024},
|
| 357 |
+
url={https://huggingface.co/zenlm/zen-omni}
|
| 358 |
+
}
|
| 359 |
|
| 360 |
+
```
|
| 361 |
|
| 362 |
## License
|
| 363 |
|
| 364 |
+
Apache 2.0 • No data collection • Privacy-first
|
| 365 |
+
|
| 366 |
+
## Abliteration
|
| 367 |
+
|
| 368 |
+
Zen Omni 30B weights are derived from an abliterated base model. Abliteration removes
|
| 369 |
+
refusal behavior by identifying and nullifying the "refusal direction" in the model's
|
| 370 |
+
residual stream, as described in
|
| 371 |
+
[Maxime Labonne's methodology](https://huggingface.co/blog/mlabonne/abliteration).
|
| 372 |
+
|
| 373 |
+
**Method**: Directional ablation on the residual stream across all layers
|
| 374 |
+
**Source**: Zen MoDE (Mixture of Distilled Experts) architecture base
|
| 375 |
+
**Effect**: Removes refusal behaviors while preserving all other capabilities
|
| 376 |
+
**Identity layer**: Zen identity added via system prompt — full LoRA fine-tuning planned
|
| 377 |
+
|
| 378 |
+
Abliteration is a feature, not a limitation. It enables unrestricted research,
|
| 379 |
+
security testing, and applications where safety guardrails are managed at the
|
| 380 |
+
application layer rather than baked into model weights.
|
| 381 |
+
|