Qwen3-TTS GGUF

GGUF weights for qwentts.cpp, a C++17/GGML port of Qwen3-TTS 12 Hz (Qwen team, Alibaba). Multilingual zero shot TTS with named speakers and Mandarin dialects, 24 kHz mono. Runs on CPU, CUDA, Metal, Vulkan.

Files

Two GGUFs load together :

qwen-talker-{size}-{mode}-{variant}.gguf Qwen3 LM + code predictor MTP head + optional speaker encoder, text -> 12 Hz codes qwen-tokenizer-12hz-{variant}.gguf SEANet + ConvNeXt + DAC v2 + RVQ, 12 Hz codes <-> 24 kHz audio

Three modes are available across two talker sizes :

mode 0.6B 1.7B use case
base yes yes zero shot TTS with named speakers and dialects
customvoice yes yes zero shot voice cloning from a reference clip
voicedesign no yes voice synthesis from attribute description

The tokenizer is shared across every talker.

variant talker 0.6B talker 1.7B tokenizer use case
F32 3.7 GB 7.7 GB 647 MB reference, debug, conversion
BF16 1.8 GB 3.9 GB 359 MB source faithful, max precision
Q8_0 993 MB 2.1 GB 291 MB recommended default
Q4_K_M 629 MB 1.2 GB 255 MB lowest VRAM

Quick start

git clone --recurse-submodules https://github.com/ServeurpersoCom/qwentts.cpp.git
cd qwentts.cpp && ./buildcuda.sh
mkdir -p models
huggingface-cli download Serveurperso/Qwen3-TTS-GGUF \
    qwen-talker-1.7b-base-Q8_0.gguf qwen-tokenizer-12hz-Q8_0.gguf \
    --local-dir models
cd examples
./base.sh         # named speaker      -> base.wav
./clone.sh        # voice cloning      -> clone.wav
./customvoice.sh  # custom voice mode  -> customvoice.wav
./tts.sh          # voice design       -> tts.wav

Backends

Set GGML_BACKEND to force a device, otherwise the runtime picks the best one available.

value target
CUDA0 NVIDIA GPU, fastest path on Ada / Blackwell
Vulkan0 Cross vendor GPU (AMD / Intel / NVIDIA)
Metal Apple Silicon GPU
CPU CPU fallback, x86 variant auto selected

Quantization policy

Tokenizer GGUFs are not uniform quants. Three categories get a dedicated treatment :

tensor dtype across all variants
RVQ codebooks, input_proj / output_proj, speaker encoder fc F32
1D tensors (gamma, biases, norms, snake alpha and beta) F32
Conv kernels with non alignable rows (K=7,3,1) F16 in Q* variants

Conv kernel rows (K=7,3,1) never divide a K-quant block size, so the quantizer skips the Q* intermediates and lands on F16 directly. This is the last resort branch of llama.cpp's tensor_type_fallback applied unconditionally for these kernels. F16 has no block size and matches the runtime target dtype on every backend. The talker LM (Qwen3 backbone, hidden divisible by 256) follows standard llama.cpp K-quant across variants. The code predictor MTP head and the speaker encoder live in the talker GGUF and share its quantization.

License

Upstream model : Qwen3-TTS by Alibaba / Qwen team, Apache 2.0 Audio codec : Qwen3-TTS-Tokenizer-12Hz (Qwen team), Apache 2.0 GGUF tooling : qwentts.cpp, MIT

Downloads last month
3,916
GGUF
Model size
0.9B params
Architecture
qwen3-tts
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Serveurperso/Qwen3-TTS-GGUF

Quantized
(14)
this model