Qwen3-TTS GGUF
GGUF weights for qwentts.cpp, a C++17/GGML port of Qwen3-TTS 12 Hz (Qwen team, Alibaba). Multilingual zero shot TTS with named speakers and Mandarin dialects, 24 kHz mono. Runs on CPU, CUDA, Metal, Vulkan.
Files
Two GGUFs load together :
qwen-talker-{size}-{mode}-{variant}.gguf Qwen3 LM + code predictor MTP head + optional speaker encoder, text -> 12 Hz codes qwen-tokenizer-12hz-{variant}.gguf SEANet + ConvNeXt + DAC v2 + RVQ, 12 Hz codes <-> 24 kHz audio
Three modes are available across two talker sizes :
| mode | 0.6B | 1.7B | use case |
|---|---|---|---|
| base | yes | yes | zero shot TTS with named speakers and dialects |
| customvoice | yes | yes | zero shot voice cloning from a reference clip |
| voicedesign | no | yes | voice synthesis from attribute description |
The tokenizer is shared across every talker.
| variant | talker 0.6B | talker 1.7B | tokenizer | use case |
|---|---|---|---|---|
| F32 | 3.7 GB | 7.7 GB | 647 MB | reference, debug, conversion |
| BF16 | 1.8 GB | 3.9 GB | 359 MB | source faithful, max precision |
| Q8_0 | 993 MB | 2.1 GB | 291 MB | recommended default |
| Q4_K_M | 629 MB | 1.2 GB | 255 MB | lowest VRAM |
Quick start
git clone --recurse-submodules https://github.com/ServeurpersoCom/qwentts.cpp.git
cd qwentts.cpp && ./buildcuda.sh
mkdir -p models
huggingface-cli download Serveurperso/Qwen3-TTS-GGUF \
qwen-talker-1.7b-base-Q8_0.gguf qwen-tokenizer-12hz-Q8_0.gguf \
--local-dir models
cd examples
./base.sh # named speaker -> base.wav
./clone.sh # voice cloning -> clone.wav
./customvoice.sh # custom voice mode -> customvoice.wav
./tts.sh # voice design -> tts.wav
Backends
Set GGML_BACKEND to force a device, otherwise the runtime picks the
best one available.
| value | target |
|---|---|
CUDA0 |
NVIDIA GPU, fastest path on Ada / Blackwell |
Vulkan0 |
Cross vendor GPU (AMD / Intel / NVIDIA) |
Metal |
Apple Silicon GPU |
CPU |
CPU fallback, x86 variant auto selected |
Quantization policy
Tokenizer GGUFs are not uniform quants. Three categories get a dedicated treatment :
| tensor | dtype across all variants |
|---|---|
| RVQ codebooks, input_proj / output_proj, speaker encoder fc | F32 |
| 1D tensors (gamma, biases, norms, snake alpha and beta) | F32 |
| Conv kernels with non alignable rows (K=7,3,1) | F16 in Q* variants |
Conv kernel rows (K=7,3,1) never divide a K-quant block size, so the
quantizer skips the Q* intermediates and lands on F16 directly. This
is the last resort branch of llama.cpp's tensor_type_fallback
applied unconditionally for these kernels. F16 has no block size and
matches the runtime target dtype on every backend. The talker LM
(Qwen3 backbone, hidden divisible by 256) follows standard llama.cpp
K-quant across variants. The code predictor MTP head and the speaker
encoder live in the talker GGUF and share its quantization.
License
Upstream model : Qwen3-TTS by Alibaba / Qwen team, Apache 2.0 Audio codec : Qwen3-TTS-Tokenizer-12Hz (Qwen team), Apache 2.0 GGUF tooling : qwentts.cpp, MIT
- Downloads last month
- 3,916
4-bit
8-bit
16-bit
32-bit
Model tree for Serveurperso/Qwen3-TTS-GGUF
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-Base