Error loading post - AOS for Lemmy.World - A generic Lemmy server for everyone to use.

https://unsloth.ai/docs/models/qwen3.6#mtp-guide
Unsloth made a guide and has graphs with comparisons

This does 18tps on 2x R9700:

[Qwen3.6-27B-Q8_0-Code-256K]
m = /models/Qwen3.6-27B/Qwen3.6-27B-Q8_0.gguf
mmproj = /models/Qwen3.6-27B/mmproj-BF16.gguf
chat-template-kwargs = {"preserve_thinking": true}
ctx-size = 262144
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0

This does 39tps on the same hardware:

[Qwen3.6-27B-MTP-Q8_0-Code-256K]
m = /models/Qwen3.6-27B-MTP/Qwen3.6-27B-Q8_0.gguf
mmproj = /models/Qwen3.6-27B-MTP/mmproj-BF16.gguf
spec-type = draft-mtp
spec-draft-n-max = 2
chat-template-kwargs = {"preserve_thinking": true}
ctx-size = 262144
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0

😱

Using MTP combined with tensor parallelism, I was able to go from running Qwen3.6 27b at ~7t/s to ~30t/s which I think is an insane boost (3x RTX 2000e Ada).