Qwen3-TTS GGUF

GGUF weights for qwentts.cpp, a C++17/GGML port of Qwen3-TTS 12 Hz (Qwen team, Alibaba). Multilingual zero shot TTS with named speakers and Mandarin dialects, 24 kHz mono. Runs on CPU, CUDA, Metal, Vulkan.

Files

Two GGUFs load together :

qwen-talker-{size}-{mode}-{variant}.gguf Qwen3 LM + code predictor MTP head + optional speaker encoder, text -> 12 Hz codes qwen-tokenizer-12hz-{variant}.gguf SEANet + ConvNeXt + DAC v2 + RVQ, 12 Hz codes <-> 24 kHz audio

Three modes are available across two talker sizes :

mode	0.6B	1.7B	use case
base	yes	yes	zero shot TTS with named speakers and dialects
customvoice	yes	yes	zero shot voice cloning from a reference clip
voicedesign	no	yes	voice synthesis from attribute description

The tokenizer is shared across every talker.

variant	talker 0.6B	talker 1.7B	tokenizer	use case
F32	3.7 GB	7.7 GB	647 MB	reference, debug, conversion
BF16	1.8 GB	3.9 GB	359 MB	source faithful, max precision
Q8_0	993 MB	2.1 GB	291 MB	recommended default
Q4_K_M	629 MB	1.2 GB	255 MB	lowest VRAM

Quick start

git clone --recurse-submodules https://github.com/ServeurpersoCom/qwentts.cpp.git
cd qwentts.cpp && ./buildcuda.sh
mkdir -p models
huggingface-cli download Serveurperso/Qwen3-TTS-GGUF \
    qwen-talker-1.7b-base-Q8_0.gguf qwen-tokenizer-12hz-Q8_0.gguf \
    --local-dir models
cd examples
./base.sh         # named speaker      -> base.wav
./clone.sh        # voice cloning      -> clone.wav
./customvoice.sh  # custom voice mode  -> customvoice.wav
./tts.sh          # voice design       -> tts.wav

Backends

Set GGML_BACKEND to force a device, otherwise the runtime picks the best one available.

value	target
`CUDA0`	NVIDIA GPU, fastest path on Ada / Blackwell
`Vulkan0`	Cross vendor GPU (AMD / Intel / NVIDIA)
`Metal`	Apple Silicon GPU
`CPU`	CPU fallback, x86 variant auto selected

Quantization policy

Tokenizer GGUFs are not uniform quants. Three categories get a dedicated treatment :

tensor	dtype across all variants
RVQ codebooks, input_proj / output_proj, speaker encoder fc	F32
1D tensors (gamma, biases, norms, snake alpha and beta)	F32
Conv kernels with non alignable rows (K=7,3,1)	F16 in Q* variants

Conv kernel rows (K=7,3,1) never divide a K-quant block size, so the quantizer skips the Q* intermediates and lands on F16 directly. This is the last resort branch of llama.cpp's tensor_type_fallback applied unconditionally for these kernels. F16 has no block size and matches the runtime target dtype on every backend. The talker LM (Qwen3 backbone, hidden divisible by 256) follows standard llama.cpp K-quant across variants. The code predictor MTP head and the speaker encoder live in the talker GGUF and share its quantization.

License

Upstream model : Qwen3-TTS by Alibaba / Qwen team, Apache 2.0 Audio codec : Qwen3-TTS-Tokenizer-12Hz (Qwen team), Apache 2.0 GGUF tooling : qwentts.cpp, MIT

Downloads last month: 3,916

GGUF

Model size

0.9B params

Architecture

qwen3-tts

Hardware compatibility

4-bit

8-bit

16-bit

32-bit

Model tree for Serveurperso/Qwen3-TTS-GGUF

Base model

Qwen/Qwen3-TTS-12Hz-0.6B-Base

Quantized

(14)

this model