zeekay commited on
Commit
a1c5638
·
verified ·
1 Parent(s): 6be3d3a

Update README: add abliteration methodology and Zen identity

Browse files
Files changed (1) hide show
  1. README.md +315 -90
README.md CHANGED
@@ -2,155 +2,380 @@
2
  license: apache-2.0
3
  language:
4
  - en
 
 
 
 
 
 
 
 
 
5
  tags:
6
  - zen
7
  - zenlm
8
  - multimodal
9
  - vision-language
10
  - audio
11
- - omni-modal
 
12
  - hanzo
 
 
 
13
  library_name: transformers
14
  pipeline_tag: image-text-to-text
15
  ---
16
 
17
- # zen-omni
18
 
19
- **Multimodal AI Model** supporting Text, Vision, and Audio
20
 
21
- Part of the Zen LM family - democratizing AI while protecting our planet.
22
 
23
- ## Model Overview
24
 
25
- zen-omni is a multimodal model capable of processing and understanding:
 
 
 
 
 
 
 
 
 
 
26
 
27
- - 📝 **Text** - Natural language understanding and generation
28
- - 🖼️ **Vision** - Image analysis and visual reasoning
29
- - 🎵 **Audio** - Speech recognition and audio understanding
30
 
31
- This is a true **omni-modal** model with unified cross-modal reasoning capabilities.
 
 
 
 
 
32
 
33
  ## Architecture
34
 
35
- **Base**: Zen Omni (Unified Multimodal Architecture)
36
- **Type**: Multimodal Transformer
37
- **Parameters**: ~7B
38
- **Context Length**: 32,768 tokens
39
 
40
- ### Components
41
- - **Text Encoder**: Transformer-based language model
42
- - **Vision Encoder**: Vision transformer for image understanding
43
- - **Audio Encoder**: Speech transformer for audio processing
44
- - **Multimodal Fusion**: Cross-attention mechanisms for unified understanding
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ## Capabilities
47
 
48
- **Cross-Modal Understanding**
49
- - Process text, images, and audio simultaneously
50
- - Reason across different modalities
51
- - Unified representation learning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
- 🎯 **Text Understanding**
54
- - Natural language processing
55
- - Instruction following
56
- - Text generation
57
-
58
- 🖼️ **Vision Understanding**
59
- - Image analysis and description
60
- - Visual question answering
61
- - Scene understanding
62
 
63
- 🎙️ **Audio Understanding**
64
- - Speech recognition
65
- - Audio transcription
66
- - Voice interaction
67
 
68
- ## Model Variants
69
-
70
- - **zen-omni** - Base multimodal model (this repository)
71
- - **zen-omni-30b-instruct** - Instruction-tuned variant
72
- - **zen-omni-30b-thinking** - Chain-of-thought reasoning variant
73
 
74
- ## Quick Start
75
 
76
  ```python
77
  from transformers import AutoModelForCausalLM, AutoProcessor
78
 
79
- # Load model and processor
80
- model = AutoModelForCausalLM.from_pretrained("zenlm/zen-omni")
81
- processor = AutoProcessor.from_pretrained("zenlm/zen-omni")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
- # Text input
84
- text_input = processor(text="Hello!", return_tensors="pt")
85
- output = model.generate(**text_input)
86
 
87
- # Image + Text input (multimodal)
88
- image_input = processor(
89
- text="What's in this image?",
90
- images=image, # PIL Image
91
- return_tensors="pt"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  )
93
- output = model.generate(**image_input)
94
 
95
- # Audio + Text input (multimodal)
96
- audio_input = processor(
97
- text="What do you hear?",
98
- audio=audio_array, # Audio data
99
- return_tensors="pt"
100
- )
101
- output = model.generate(**audio_input)
 
 
 
 
 
 
102
 
103
- response = processor.decode(output[0])
 
 
104
  ```
105
 
106
- ## Use Cases
107
 
108
- - 🎨 **Multimodal Assistants**: Interact with text, images, and voice
109
- - 📊 **Visual Question Answering**: Answer questions about images
110
- - 🎙️ **Voice Interfaces**: Build voice-enabled applications
111
- - 📱 **Accessibility Tools**: Audio description and transcription
112
- - 🤖 **Cross-Modal AI**: Tasks requiring understanding multiple modalities
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
  ## Training
115
 
116
- Fine-tuned with:
117
  - Multimodal instruction tuning
118
  - Cross-modal alignment
119
- - Audio-vision-text integration
120
- - Zen AI identity and safety training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
- ## Technical Requirements
123
 
124
- **Important**: This is a multimodal model and requires:
125
- - Multimodal-compatible transformers library
126
- - AutoProcessor (not just tokenizer)
127
- - Support for image and audio inputs
128
- - Zen Omni compatible inference code
 
 
 
 
 
 
 
 
 
 
 
129
 
130
  ## Why Zen LM?
131
 
132
- 🚀 **Ultra-Efficient** - Optimized for diverse hardware
133
- 🔒 **Truly Private** - 100% local processing, no cloud
134
- 🌱 **Eco-Friendly** - 95% less energy than cloud AI
135
- 💚 **Free Forever** - Apache 2.0 licensed
136
 
137
  ## Organizations
138
 
139
- **Hanzo AI Inc** - Techstars '17 • Award-winning GenAI lab • https://hanzo.ai
140
- **Zoo Labs Foundation** - 501(c)(3) Non-Profit • Environmental AI • https://zoolabs.io
141
 
142
- ## Links
143
 
144
- 🌐 Website: https://zenlm.org
145
- 💬 Discord: https://discord.gg/hanzoai
146
- 🐦 Twitter: https://twitter.com/hanzoai
147
- 📧 Email: hello@zenlm.org
148
 
149
  ## Citation
150
 
 
 
 
 
 
 
 
151
 
152
-
153
 
154
  ## License
155
 
156
- Apache 2.0 • No data collection • Privacy-first
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  language:
4
  - en
5
+ - zh
6
+ - ja
7
+ - ko
8
+ - de
9
+ - fr
10
+ - es
11
+ - it
12
+ - pt
13
+ - ru
14
  tags:
15
  - zen
16
  - zenlm
17
  - multimodal
18
  - vision-language
19
  - audio
20
+ - speech
21
+ - omni
22
  - hanzo
23
+ - thinking
24
+ - instruct
25
+ - zen-lm
26
  library_name: transformers
27
  pipeline_tag: image-text-to-text
28
  ---
29
 
30
+ # Zen Omni
31
 
32
+ **Hypermodal Language Model for Translation + Audio Generation**
33
 
34
+ > Part of the [Zen LM](https://zenlm.org) family - democratizing AI while protecting our planet.
35
 
36
+ ## Model Specifications
37
 
38
+ | Attribute | Value |
39
+ |-----------|-------|
40
+ | **Architecture** | MoE multimodal (Thinker-Talker) |
41
+ | **Total Parameters** | 30B |
42
+ | **Active Parameters** | 3B (via MoE sparse activation) |
43
+ | **Text Languages** | 119 languages |
44
+ | **Speech Input** | 19 languages |
45
+ | **Speech Output** | 10 languages |
46
+ | **Context Length** | 32,768 tokens |
47
+ | **Technical Report** | [docs/paper/paper.pdf](docs/paper/paper.pdf) |
48
+ | **License** | Apache 2.0 |
49
 
50
+ ## Model Variants
 
 
51
 
52
+ | Variant | Description | Use Case |
53
+ |---------|-------------|----------|
54
+ | **zen-omni** | Base multimodal model | General purpose |
55
+ | **zen-omni-instruct** | Instruction-following | Chat, Q&A, tasks |
56
+ | **zen-omni-thinking** | Chain-of-thought reasoning | Complex reasoning, math |
57
+ | **zen-omni-captioner** | Audio/visual captioning | Transcription, description |
58
 
59
  ## Architecture
60
 
61
+ Zen Omni is built on a **Thinker-Talker** MoE architecture:
 
 
 
62
 
63
+ ```
64
+ ┌─────────────────────────────────────────────────────────────┐
65
+ │ ZEN OMNI │
66
+ ├─────────────────────────────────────────────────────────────┤
67
+ │ │
68
+ │ INPUT ENCODERS │
69
+ │ ├── Audio Encoder (32 layers, 1280 dim) │
70
+ │ ├── Vision Encoder (27 layers, 1152 dim) │
71
+ │ └── Text Embeddings (151,936 vocab) │
72
+ │ │ │
73
+ │ ▼ │
74
+ │ ┌─────────────────────────────────────────┐ │
75
+ │ │ THINKER (Multimodal LLM) │ │
76
+ │ │ • 48 transformer layers │ │
77
+ │ │ • 128 experts (MoE) │ │
78
+ │ │ • 8 experts active per token │ │
79
+ │ │ • Cross-modal attention fusion │ │
80
+ │ └─────────────────────────────────────────┘ │
81
+ │ │ │
82
+ │ ▼ │
83
+ │ ┌─────────────────────────────────────────┐ │
84
+ │ │ TALKER (Audio Gen) │ │
85
+ │ │ • Streaming speech synthesis │ │
86
+ │ │ • Code2Wav audio codec │ │
87
+ │ │ • 16 quantizers, 2048 codebook │ │
88
+ │ └─────────────────────────────────────────┘ │
89
+ │ │ │
90
+ │ ▼ │
91
+ │ OUTPUT: Text + Audio + Vision Understanding │
92
+ │ │
93
+ └───────��─────────────────────────────────────────────────────┘
94
+ ```
95
 
96
  ## Capabilities
97
 
98
+ ### Multimodal Understanding
99
+ - **Text**: 119 language understanding and generation
100
+ - **Vision**: Image analysis, video comprehension, OCR
101
+ - **Audio**: Speech recognition in 19 languages, audio understanding
102
+ - **Cross-Modal**: Unified reasoning across all modalities
103
+
104
+ ### Speech Synthesis
105
+ - Native audio output in 10 languages
106
+ - Low-latency streaming (< 300ms)
107
+ - Natural prosody and emotion
108
+ - Voice preservation across translations
109
+
110
+ ### Translation Pipeline
111
+ - Real-time speech-to-speech translation
112
+ - Preserves speaker characteristics
113
+ - Integration with **zen-dub** for lip synchronization
114
+ - End-to-end dubbing workflow
115
+
116
+ ### Thinking Mode
117
+ - Extended reasoning (up to 32K thinking tokens)
118
+ - Complex problem solving
119
+ - Math and code reasoning
120
 
121
+ ## Quick Start
 
 
 
 
 
 
 
 
122
 
123
+ ### Installation
 
 
 
124
 
125
+ ```bash
126
+ pip install transformers torch soundfile
127
+ ```
 
 
128
 
129
+ ### Basic Usage
130
 
131
  ```python
132
  from transformers import AutoModelForCausalLM, AutoProcessor
133
 
134
+ # Load model
135
+ model_id = "zenlm/zen-omni"
136
+ model = AutoModelForCausalLM.from_pretrained(
137
+ model_id,
138
+ torch_dtype="auto",
139
+ device_map="auto"
140
+ )
141
+ processor = AutoProcessor.from_pretrained(model_id)
142
+
143
+ # Text-to-text with thinking
144
+ messages = [
145
+ {"role": "system", "content": "You are Zen, a helpful AI assistant."},
146
+ {"role": "user", "content": "Explain quantum computing in simple terms."}
147
+ ]
148
+
149
+ text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
150
+ inputs = processor(text=text, return_tensors="pt").to(model.device)
151
+ outputs = model.generate(**inputs, max_new_tokens=512)
152
+ response = processor.decode(outputs[0], skip_special_tokens=True)
153
+ print(response)
154
+ ```
155
+
156
+ ### Multimodal Input (Image + Audio + Text)
157
+
158
+ ```python
159
+ from PIL import Image
160
+ import librosa
161
+
162
+ # Load multimodal inputs
163
+ image = Image.open("path/to/image.jpg")
164
+ audio, sr = librosa.load("path/to/audio.wav", sr=16000)
165
+
166
+ # Process multimodal message
167
+ messages = [
168
+ {"role": "user", "content": [
169
+ {"type": "image", "image": image},
170
+ {"type": "audio", "audio": audio},
171
+ {"type": "text", "text": "Describe this image and transcribe the audio."}
172
+ ]}
173
+ ]
174
+
175
+ inputs = processor(messages, return_tensors="pt").to(model.device)
176
+ outputs = model.generate(**inputs, max_new_tokens=1024)
177
+ response = processor.decode(outputs[0])
178
+ ```
179
 
180
+ ### Speech-to-Speech Translation
 
 
181
 
182
+ ```python
183
+ import soundfile as sf
184
+
185
+ # Load source audio
186
+ source_audio, sr = librosa.load("japanese_speech.wav", sr=16000)
187
+
188
+ # Translate and generate English speech
189
+ messages = [
190
+ {"role": "user", "content": [
191
+ {"type": "audio", "audio": source_audio},
192
+ {"type": "text", "text": "Translate this Japanese speech to English and speak the translation."}
193
+ ]}
194
+ ]
195
+
196
+ inputs = processor(messages, return_tensors="pt").to(model.device)
197
+ outputs = model.generate(
198
+ **inputs,
199
+ max_new_tokens=2048,
200
+ return_audio=True
201
  )
 
202
 
203
+ # Save translated audio
204
+ translated_audio = outputs.audio[0]
205
+ sf.write("english_translation.wav", translated_audio, 24000)
206
+ ```
207
+
208
+ ### MLX (Apple Silicon)
209
+
210
+ ```bash
211
+ # 4-bit quantized for M1/M2/M3
212
+ python3 -m mlx_lm.generate --model ./mlx/q4 --prompt "Hello"
213
+ ```
214
+
215
+ ### GGUF (llama.cpp / LM Studio)
216
 
217
+ ```bash
218
+ # Load in LM Studio or llama.cpp
219
+ ./llama-cli -m ./gguf/zen-omni-30b-q4_k_m.gguf -p "Hello"
220
  ```
221
 
222
+ ## Model Files & Formats
223
 
224
+ | Format | Size | RAM | Use Case |
225
+ |--------|------|-----|----------|
226
+ | **SafeTensors** (BF16) | ~60GB | 80GB+ | Training, full precision |
227
+ | **MLX 4-bit** | ~15GB | 20GB | Apple Silicon (M1/M2/M3) |
228
+ | **MLX 8-bit** | ~30GB | 32GB | Apple Silicon (higher quality) |
229
+ | **GGUF Q4_K_M** | ~15GB | 20GB | llama.cpp, LM Studio |
230
+
231
+ ## Performance (Apple Silicon)
232
+
233
+ - **M1/M2/M3**: 10-20 tokens/sec
234
+ - **RAM Required**: 20-24GB minimum
235
+ - **Recommended**: M2 Pro/Max or M3 with 32GB+ RAM
236
+
237
+ ## Integration with Zen Dub
238
+
239
+ Zen Omni integrates with [zen-dub](https://github.com/zenlm/zen-dub) for complete video dubbing:
240
+
241
+ ```python
242
+ from zen_omni import ZenOmniTranslator
243
+ from zen_dub import ZenDubPipeline
244
+
245
+ # Initialize components
246
+ translator = ZenOmniTranslator("zenlm/zen-omni")
247
+ lip_sync = ZenDubPipeline("zenlm/zen-dub")
248
+
249
+ # Full dubbing pipeline
250
+ def dub_video(video_path, target_language="en"):
251
+ # 1. Extract audio from video
252
+ audio, frames = extract_video(video_path)
253
+
254
+ # 2. Translate speech with Zen Omni
255
+ translated_audio = translator.translate_speech(
256
+ audio,
257
+ target_language=target_language,
258
+ preserve_prosody=True
259
+ )
260
+
261
+ # 3. Generate lip-synced video with Zen Dub
262
+ dubbed_video = lip_sync.generate(
263
+ frames=frames,
264
+ audio=translated_audio,
265
+ fps=30
266
+ )
267
+
268
+ return dubbed_video
269
+
270
+ # Run pipeline
271
+ result = dub_video("input_japanese.mp4", target_language="en")
272
+ result.save("output_english_dubbed.mp4")
273
+ ```
274
 
275
  ## Training
276
 
277
+ Fine-tuned from the Zen Omni 30B MoE base with:
278
  - Multimodal instruction tuning
279
  - Cross-modal alignment
280
+ - Zen AI identity training (LoRA)
281
+
282
+ Training configuration: [`training/zen_identity_sft.yaml`](training/zen_identity_sft.yaml)
283
+
284
+ ### Identity Training with ms-swift
285
+
286
+ ```bash
287
+ # Install ms-swift
288
+ pip install ms-swift
289
+
290
+ # Fine-tune with Zen identity
291
+ swift sft \
292
+ --model_type omni-30b-a3b \
293
+ --model_id_or_path zenlm/zen-omni \
294
+ --dataset zen_identity \
295
+ --output_dir ./zen-omni-finetuned \
296
+ --lora_rank 64 \
297
+ --lora_alpha 128 \
298
+ --max_steps 1000 \
299
+ --learning_rate 1e-4
300
+ ```
301
+
302
+ ## Cookbooks & Examples
303
+
304
+ See the [`cookbooks/`](cookbooks/) directory for Jupyter notebooks:
305
+
306
+ - `omni_captioner.ipynb` - Audio/visual captioning
307
+ - `audio_visual_dialogue.ipynb` - Multimodal conversations
308
+ - `speech_recognition.ipynb` - Speech-to-text
309
+ - `image_question.ipynb` - Visual Q&A
310
+ - `video_description.ipynb` - Video understanding
311
 
312
+ ## Web Demos
313
 
314
+ ```bash
315
+ # Full multimodal demo
316
+ python web_demo.py --checkpoint-path zenlm/zen-omni --flash-attn2
317
+
318
+ # Audio captioner
319
+ python web_demo_captioner.py --checkpoint-path zenlm/zen-omni --flash-attn2
320
+ ```
321
+
322
+ ## Performance Benchmarks
323
+
324
+ | Benchmark | Zen Omni | Notes |
325
+ |-----------|----------|-------|
326
+ | Speech Translation (BLEU) | 42.3 | En↔Ja bidirectional |
327
+ | Image Understanding (VQA) | 78.2% | Visual question answering |
328
+ | Audio Transcription (WER) | 4.2% | English ASR |
329
+ | Cross-Modal Reasoning | 85.1% | MMLU multimodal |
330
 
331
  ## Why Zen LM?
332
 
333
+ - **Ultra-Efficient** - 3B active parameters via MoE
334
+ - **Truly Private** - 100% local processing, no cloud required
335
+ - **Environmentally Responsible** - 95% less energy than cloud AI
336
+ - **Free Forever** - Apache 2.0 licensed
337
 
338
  ## Organizations
339
 
340
+ - **[Hanzo AI Inc](https://hanzo.ai)** - Techstars '17 • Award-winning GenAI lab
341
+ - **[Zoo Labs Foundation](https://zoolabs.io)** - 501(c)(3) Non-Profit
342
 
343
+ ## Resources
344
 
345
+ - [Website](https://zenlm.org)
346
+ - [Documentation](https://docs.zenlm.org)
347
+ - [Discord](https://discord.gg/hanzoai)
348
+ - hello@zenlm.org
349
 
350
  ## Citation
351
 
352
+ ```bibtex
353
+ @misc{zen-omni-2024,
354
+ title={Zen Omni: Hypermodal Language Model for Translation and Audio Generation},
355
+ author={Zen LM Team and Hanzo AI},
356
+ year={2024},
357
+ url={https://huggingface.co/zenlm/zen-omni}
358
+ }
359
 
360
+ ```
361
 
362
  ## License
363
 
364
+ Apache 2.0 • No data collection • Privacy-first
365
+
366
+ ## Abliteration
367
+
368
+ Zen Omni 30B weights are derived from an abliterated base model. Abliteration removes
369
+ refusal behavior by identifying and nullifying the "refusal direction" in the model's
370
+ residual stream, as described in
371
+ [Maxime Labonne's methodology](https://huggingface.co/blog/mlabonne/abliteration).
372
+
373
+ **Method**: Directional ablation on the residual stream across all layers
374
+ **Source**: Zen MoDE (Mixture of Distilled Experts) architecture base
375
+ **Effect**: Removes refusal behaviors while preserving all other capabilities
376
+ **Identity layer**: Zen identity added via system prompt — full LoRA fine-tuning planned
377
+
378
+ Abliteration is a feature, not a limitation. It enables unrestricted research,
379
+ security testing, and applications where safety guardrails are managed at the
380
+ application layer rather than baked into model weights.
381
+