Qwen3.6-27B Uncensored Heretic v2

MLX 4-bit · Apple Silicon native

Text · Vision · Video · Thinking · Tool Calling

2026-04-30 — Re-converted from updated source. The upstream model (llmfan46) was re-done using MPOA (Magnitude-Preserving Orthogonal Ablation) replacing the earlier ARA method. Key improvements: refusals dropped from 13/100 to 6/100, KL divergence from 0.0035 to 0.0021, and reported issues with EOS spam and generation interruptions are fixed. If you downloaded before April 30, re-download for the better version.

Why this model?

Two things set this apart from other Qwen 3.6 conversions:

1. Architecture-aware uncensoring. Qwen 3.6 uses a hybrid attention design — linear (DeltaNet-style) and traditional softmax blocks, mixed 3:1. Most abliteration tools treat them the same. llmfan46 applied separate parameters for each attention type using the Heretic tool with the MPOA (Magnitude-Preserving Orthogonal Ablation) method, yielding one of the lowest KL divergences of any uncensored Qwen variant — dramatically fewer refusals with negligible capability loss.

2. A fixed chat template. The official Qwen 3.6 template is broken on every C++ runtime (LM Studio, llama.cpp, MLX). Tool calls crash, the developer role throws errors, and empty thinking blocks waste your context window. This model ships with a rewritten template that fixes all five issues and adds a thinking toggle (<|think_on|> / <|think_off|>) you can drop into any message.

Quick start

Text

from mlx_lm import load, generate

model, tokenizer = load("froggeric/Qwen3.6-27B-Uncensored-Heretic-v2-MLX-4bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=0.7)
print(response)

Vision

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("froggeric/Qwen3.6-27B-Uncensored-Heretic-v2-MLX-4bit")
image = ["path/to/image.jpg"]
prompt = "Describe this image."
formatted = apply_chat_template(processor, model.config, prompt, num_images=len(image))
result = generate(model, processor, formatted, image, max_tokens=256, temp=0.7)
print(result.text)

CLI

# Text
mlx_lm.generate \
  --model froggeric/Qwen3.6-27B-Uncensored-Heretic-v2-MLX-4bit \
  --prompt "Hello"

# Vision
mlx_vlm.generate \
  --model froggeric/Qwen3.6-27B-Uncensored-Heretic-v2-MLX-4bit \
  --image image.jpg --prompt "Describe this image"

Requirements: mlx-lm >= 0.31.2, mlx-vlm >= 0.4.4

System prompt

The first line of your system prompt must be:

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

The model underperforms without it. You can append anything after that line.

Thinking toggle

Drop <|think_on|> or <|think_off|> anywhere in your system or user prompt. The template intercepts the tag, strips it from context so the model never sees it, and flips the mode.

Fast answer, no reasoning:

System: You are a coding assistant. <|think_off|>
User: What's 2+2?

Deep reasoning:

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

Chat template fixes

The official Qwen 3.6 Jinja template has five bugs that break real usage. This model ships with a rewritten template that fixes all of them:

Bug	Impact	Fix
`	items` filter in tool calls	Crashes on every C++ runtime (LM Studio, llama.cpp, MLX)
`	safe` filter	Python-only, does not exist in C++ Jinja
`developer` role	Modern APIs send it; official template throws an error	Maps to `system`
Empty thinking blocks	Wraps every past turn in tags, even with nothing inside — wastes context tokens	Only emitted when `reasoning_content` is non-empty
`</thinking>` hallucination	Model sometimes generates the wrong closing tag; parser fails	Detects which tag was used and splits on that

Works in LM Studio, llama.cpp (--jinja), vLLM, MLX, oMLX, and any engine that supports HuggingFace Jinja templates.

The uncensoring

This model uses Heretic v1.2.0 with the MPOA (Magnitude-Preserving Orthogonal Ablation) method.

How it works

Heretic identifies the "refusal direction" in the model's residual stream by comparing activations on harmless vs. harmful prompts, then orthogonalizes specific weight matrices against that direction so the model can no longer express refusal behavior.

MPOA preserves the norm of the original weight matrices during abliteration, maintaining the model's activation distributions and thus its capabilities — unlike simple orthogonal projection which can distort the activation landscape.

What llmfan46 did differently

Standard Heretic treats all attention blocks identically. Qwen 3.6's hybrid architecture mixes linear attention (DeltaNet-style) and traditional softmax attention in a 3:1 ratio. llmfan46 applied separate abliteration parameters for each attention type, allowing more precise removal of refusal behavior with less collateral damage to model capabilities.

This approach was submitted as a pull request to Heretic but was not merged — not because it doesn't work, but because the extra parameters increase optimization time. For this specific architecture, it produces superior results.

How it compares

Community results

r/LocalLLaMA users have been A/B-testing various uncensored Qwen 3.6 variants — Heretic, HauhauCS Aggressive, abliterix, and simple orthogonal projection. The pattern is consistent: Heretic produces the best balance of refusal removal and output quality.

Community discussion →

Why

Most abliteration methods treat all layers identically. Qwen 3.6's hybrid attention (3:1 linear-to-softmax ratio) means a single parameter set either under-abliterate the DeltaNet blocks or over-abliterate the softmax blocks. Architecture-aware abliteration — separate parameters per attention type — is the key differentiator.

A note on SSM conv1d "repair"

Some uncensored variants apply a pre-processing step that rescales SSM conv1d weights before abliteration, claiming to fix "outlier" tensors in the DeltaNet linear attention layers. This technique (originating as "Sig-ScaleSync") was benchmarked with 284 data points across perplexity, needle-in-a-haystack, and repetition tests at multiple context lengths (4K–128K). Result: perplexity degraded at every length with no improvement in NIAH or repetition. The unrepaired original weights perform best.

Abliterating a degraded baseline can yield a lower measured KL divergence — but that measures distance from a worse starting point, not better preservation of the original model's capabilities.

Sampling

From the official Qwen authors. Reserve 128K+ context for thinking mode.

Mode	temp	top_p	top_k	repeat_penalty	presence_penalty
Thinking (coding)	0.6	0.95	20	1.0	off
Thinking (general)	1.0	0.95	20	1.0	1.5
Non-thinking	0.7	0.8	20	1.0	1.5

GGUF runtimes use presence_penalty (0 = off). MLX / LM Studio use repeat_penalty (1.0 = off).

This conversion


Source	llmfan46/Qwen3.6-27B-uncensored-heretic-v2 (BF16 safetensors, MPOA abliteration)
Quantization	4-bit (4.6 bits/weight, ~15 GB across 3 shards)
Chat template	Fixed Jinja template with tool calling, developer role, thinking toggle, and hallucination handling
Minimum RAM	~20 GB (15 GB weights + overhead)

Architecture details

Spec	Value
Architecture	Dense — 27.8B params, all active per token
Layers	64 (3x linear attention + 1x full attention, 16 repetitions)
Attention	24 Q heads, 4 KV heads (GQA), head_dim 256
Linear attention	16 QK heads, 48 V heads, head_dim 128
FFN	intermediate_size 17408
Context	262K native, 1M+ with YaRN
RoPE	theta 10M, partial_rotary_factor 0.25, mrope_interleaved
Vocab	248K tokens
Multimodal	Text, image, video
Multi-token prediction	Supported (1 draft layer)
model_type	`qwen3_5`

Credits

Role	Author
Original model	Alibaba Cloud (Qwen team)
Refusal direction research	Arditi et al.
MPOA method	Jim Lai
Heretic tool	Philipp Weidmann
Architecture-aware abliteration + uncensored variant	llmfan46
Fixed chat template + MLX conversion	froggeric

License

Apache-2.0, inherited from Qwen3.6.

Downloads last month: 3,336

Safetensors

Model size

5B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for froggeric/Qwen3.6-27B-Uncensored-Heretic-v2-MLX-4bit

Base model

Qwen/Qwen3.6-27B

Finetuned

llmfan46/Qwen3.6-27B-uncensored-heretic-v2

Quantized

(15)

this model

Collection including froggeric/Qwen3.6-27B-Uncensored-Heretic-v2-MLX-4bit

Qwen 3.6 MLX — Fixed & Uncensored

Collection

MLX conversions of Qwen 3.6 with fixed chat templates, vision fixes, and thinking toggle. 4/6/8-bit. • 6 items • Updated 14 days ago

Paper for froggeric/Qwen3.6-27B-Uncensored-Heretic-v2-MLX-4bit

Refusal in Language Models Is Mediated by a Single Direction

Paper • 2406.11717 • Published Jun 17, 2024 • 12

froggeric
/

Qwen3.6-27B-Uncensored-Heretic-v2-MLX-4bit