<unused49> infinite generation after llama.cpp release b8699

#2
by fakezeta - opened

Hi,

after updating to release b8699 which support attention rotation for heterogeneous iSWA (PR #21513) I'm getting infinite generation of token.
I saw that Unsloth (https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/20) that they are updating their GGUF.

Do you plan to update your quants?

Thank you in advance

Can you give an example of a command that's failing? I updated to latest llama.cpp and ran the Q4_K_M I just downloaded and have no issues

Thank you for answering, I found out that the reason is V cache quantization to Q8_0.
I was using for K cache BF16 and Q8_0 for V cache and is giving error after upgrade to b8699 with your model and also with Unsloth new ones.
Using BF16 for both is working correctly as before so probably a regression on llama.cpp.

It would be great if we could figure this out. the cache quantization really helps with vram usage.

It would be an upstream issue sadly and not related to the model quantization itself :(

Yeah fair enough.

Sign up or log in to comment