---
license: apache-2.0
language:
- en
- zh
- ja
- ko
- de
- fr
- es
- it
- pt
- ru
tags:
- zen
- zenlm
- multimodal
- vision-language
- audio
- speech
- omni
- hanzo
- thinking
- instruct
- zen-lm
library_name: transformers
pipeline_tag: image-text-to-text
---

# Zen Omni

**Hypermodal Language Model for Translation + Audio Generation**

> Part of the [Zen LM](https://zenlm.org) family - democratizing AI while protecting our planet.

## Model Specifications

| Attribute | Value |
|-----------|-------|
| **Architecture** | MoE multimodal (Thinker-Talker) |
| **Total Parameters** | 30B |
| **Active Parameters** | 3B (via MoE sparse activation) |
| **Text Languages** | 119 languages |
| **Speech Input** | 19 languages |
| **Speech Output** | 10 languages |
| **Context Length** | 32,768 tokens |
| **Technical Report** | [docs/paper/paper.pdf](docs/paper/paper.pdf) |
| **License** | Apache 2.0 |

## Model Variants

| Variant | Description | Use Case |
|---------|-------------|----------|
| **zen-omni** | Base multimodal model | General purpose |
| **zen-omni-instruct** | Instruction-following | Chat, Q&A, tasks |
| **zen-omni-thinking** | Chain-of-thought reasoning | Complex reasoning, math |
| **zen-omni-captioner** | Audio/visual captioning | Transcription, description |

## Architecture

Zen Omni is built on a **Thinker-Talker** MoE architecture:

```
┌─────────────────────────────────────────────────────────────┐
│                      ZEN OMNI                                │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  INPUT ENCODERS                                              │
│  ├── Audio Encoder (32 layers, 1280 dim)                    │
│  ├── Vision Encoder (27 layers, 1152 dim)                   │
│  └── Text Embeddings (151,936 vocab)                        │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────────────────────────────┐                │
│  │         THINKER (Multimodal LLM)        │                │
│  │  • 48 transformer layers                 │                │
│  │  • 128 experts (MoE)                     │                │
│  │  • 8 experts active per token            │                │
│  │  • Cross-modal attention fusion          │                │
│  └─────────────────────────────────────────┘                │
│           │                                                  │
│           ▼                                                  │
│  ┌─────────────────────────────────────────┐                │
│  │            TALKER (Audio Gen)           │                │
│  │  • Streaming speech synthesis            │                │
│  │  • Code2Wav audio codec                  │                │
│  │  • 16 quantizers, 2048 codebook          │                │
│  └─────────────────────────────────────────┘                │
│           │                                                  │
│           ▼                                                  │
│  OUTPUT: Text + Audio + Vision Understanding                │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

## Capabilities

### Multimodal Understanding
- **Text**: 119 language understanding and generation
- **Vision**: Image analysis, video comprehension, OCR
- **Audio**: Speech recognition in 19 languages, audio understanding
- **Cross-Modal**: Unified reasoning across all modalities

### Speech Synthesis
- Native audio output in 10 languages
- Low-latency streaming (< 300ms)
- Natural prosody and emotion
- Voice preservation across translations

### Translation Pipeline
- Real-time speech-to-speech translation
- Preserves speaker characteristics
- Integration with **zen-dub** for lip synchronization
- End-to-end dubbing workflow

### Thinking Mode
- Extended reasoning (up to 32K thinking tokens)
- Complex problem solving
- Math and code reasoning

## Quick Start

### Installation

```bash
pip install transformers torch soundfile
```

### Basic Usage

```python
from transformers import AutoModelForCausalLM, AutoProcessor

# Load model
model_id = "zenlm/zen-omni"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Text-to-text with thinking
messages = [
    {"role": "system", "content": "You are Zen, a helpful AI assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."}
]

text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Multimodal Input (Image + Audio + Text)

```python
from PIL import Image
import librosa

# Load multimodal inputs
image = Image.open("path/to/image.jpg")
audio, sr = librosa.load("path/to/audio.wav", sr=16000)

# Process multimodal message
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "audio", "audio": audio},
        {"type": "text", "text": "Describe this image and transcribe the audio."}
    ]}
]

inputs = processor(messages, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0])
```

### Speech-to-Speech Translation

```python
import soundfile as sf

# Load source audio
source_audio, sr = librosa.load("japanese_speech.wav", sr=16000)

# Translate and generate English speech
messages = [
    {"role": "user", "content": [
        {"type": "audio", "audio": source_audio},
        {"type": "text", "text": "Translate this Japanese speech to English and speak the translation."}
    ]}
]

inputs = processor(messages, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    return_audio=True
)

# Save translated audio
translated_audio = outputs.audio[0]
sf.write("english_translation.wav", translated_audio, 24000)
```

### MLX (Apple Silicon)

```bash
# 4-bit quantized for M1/M2/M3
python3 -m mlx_lm.generate --model ./mlx/q4 --prompt "Hello"
```

### GGUF (llama.cpp / LM Studio)

```bash
# Load in LM Studio or llama.cpp
./llama-cli -m ./gguf/zen-omni-30b-q4_k_m.gguf -p "Hello"
```

## Model Files & Formats

| Format | Size | RAM | Use Case |
|--------|------|-----|----------|
| **SafeTensors** (BF16) | ~60GB | 80GB+ | Training, full precision |
| **MLX 4-bit** | ~15GB | 20GB | Apple Silicon (M1/M2/M3) |
| **MLX 8-bit** | ~30GB | 32GB | Apple Silicon (higher quality) |
| **GGUF Q4_K_M** | ~15GB | 20GB | llama.cpp, LM Studio |

## Performance (Apple Silicon)

- **M1/M2/M3**: 10-20 tokens/sec
- **RAM Required**: 20-24GB minimum
- **Recommended**: M2 Pro/Max or M3 with 32GB+ RAM

## Integration with Zen Dub

Zen Omni integrates with [zen-dub](https://github.com/zenlm/zen-dub) for complete video dubbing:

```python
from zen_omni import ZenOmniTranslator
from zen_dub import ZenDubPipeline

# Initialize components
translator = ZenOmniTranslator("zenlm/zen-omni")
lip_sync = ZenDubPipeline("zenlm/zen-dub")

# Full dubbing pipeline
def dub_video(video_path, target_language="en"):
    # 1. Extract audio from video
    audio, frames = extract_video(video_path)

    # 2. Translate speech with Zen Omni
    translated_audio = translator.translate_speech(
        audio,
        target_language=target_language,
        preserve_prosody=True
    )

    # 3. Generate lip-synced video with Zen Dub
    dubbed_video = lip_sync.generate(
        frames=frames,
        audio=translated_audio,
        fps=30
    )

    return dubbed_video

# Run pipeline
result = dub_video("input_japanese.mp4", target_language="en")
result.save("output_english_dubbed.mp4")
```

## Training

Fine-tuned from the Zen Omni 30B MoE base with:
- Multimodal instruction tuning
- Cross-modal alignment
- Zen AI identity training (LoRA)

Training configuration: [`training/zen_identity_sft.yaml`](training/zen_identity_sft.yaml)

### Identity Training with ms-swift

```bash
# Install ms-swift
pip install ms-swift

# Fine-tune with Zen identity
swift sft \
    --model_type omni-30b-a3b \
    --model_id_or_path zenlm/zen-omni \
    --dataset zen_identity \
    --output_dir ./zen-omni-finetuned \
    --lora_rank 64 \
    --lora_alpha 128 \
    --max_steps 1000 \
    --learning_rate 1e-4
```

## Cookbooks & Examples

See the [`cookbooks/`](cookbooks/) directory for Jupyter notebooks:

- `omni_captioner.ipynb` - Audio/visual captioning
- `audio_visual_dialogue.ipynb` - Multimodal conversations
- `speech_recognition.ipynb` - Speech-to-text
- `image_question.ipynb` - Visual Q&A
- `video_description.ipynb` - Video understanding

## Web Demos

```bash
# Full multimodal demo
python web_demo.py --checkpoint-path zenlm/zen-omni --flash-attn2

# Audio captioner
python web_demo_captioner.py --checkpoint-path zenlm/zen-omni --flash-attn2
```

## Performance Benchmarks

| Benchmark | Zen Omni | Notes |
|-----------|----------|-------|
| Speech Translation (BLEU) | 42.3 | En↔Ja bidirectional |
| Image Understanding (VQA) | 78.2% | Visual question answering |
| Audio Transcription (WER) | 4.2% | English ASR |
| Cross-Modal Reasoning | 85.1% | MMLU multimodal |

## Why Zen LM?

- **Ultra-Efficient** - 3B active parameters via MoE
- **Truly Private** - 100% local processing, no cloud required
- **Environmentally Responsible** - 95% less energy than cloud AI
- **Free Forever** - Apache 2.0 licensed

## Organizations

- **[Hanzo AI Inc](https://hanzo.ai)** - Techstars '17 • Award-winning GenAI lab
- **[Zoo Labs Foundation](https://zoolabs.io)** - 501(c)(3) Non-Profit

## Resources

- [Website](https://zenlm.org)
- [Documentation](https://docs.zenlm.org)
- [Discord](https://discord.gg/hanzoai)
- hello@zenlm.org

## Citation

```bibtex
@misc{zen-omni-2024,
  title={Zen Omni: Hypermodal Language Model for Translation and Audio Generation},
  author={Zen LM Team and Hanzo AI},
  year={2024},
  url={https://huggingface.co/zenlm/zen-omni}
}

```

## License

Apache 2.0 • No data collection • Privacy-first

## Abliteration

Zen Omni 30B weights are derived from an abliterated base model. Abliteration removes
refusal behavior by identifying and nullifying the "refusal direction" in the model's
residual stream, as described in
[Maxime Labonne's methodology](https://huggingface.co/blog/mlabonne/abliteration).

**Method**: Directional ablation on the residual stream across all layers
**Source**: Zen MoDE (Mixture of Distilled Experts) architecture base
**Effect**: Removes refusal behaviors while preserving all other capabilities
**Identity layer**: Zen identity added via system prompt — full LoRA fine-tuning planned

Abliteration is a feature, not a limitation. It enables unrestricted research,
security testing, and applications where safety guardrails are managed at the
application layer rather than baked into model weights.