cicadagen

Did you follow a guide for setting up Speculative Decoding? I haven’t gotten it to work very well personally. Does the smaller model run on your CPU memory and the larger one fully on GPU?

No, I actually don't run a separate draft model on the CPU...

Since I'm using Qwen 3.6 27B, I'm utilizing MTP (Multi-Token Prediction) speculative decoding. This is built directly into the main model itself, so there is no extra "small model" to load or offload to the CPU. Everything runs entirely on the GPU VRAM.

To get it working well in llama.cpp, you just need two things:

  • Make sure you are using a model variant specifically trained for it (look for files with -MTP- in the name on Hugging Face, like the Unsloth ones).
  • Add the flag --spec-type draft-mtp to your startup command for docker. But I do suggest compiling llama.cpp yourself now, for better kv caching.

That’s pretty much it! Because everything stays in VRAM and uses the main model's native architecture, the draft acceptance rate is super high (around 64% for me) and it basically doubles the generation speed.

KV-cache for the MTP-model all the way down to q4_0.

From the other reply from brucethemoose@lemmy.world, I did try q8_0/q5_1. It works pretty well.

I'll try Gemma-4-12B QAT when I get time :)

You are an absolute legend! You were 100% right.

I built llama.cpp from source just like you suggested, using the -DGGML_CUDA_FA_ALL_QUANTS=ON and AVX-512 flags. The difference is insane! The silent CPU fallback is completely gone. My prompt processing jumped from a slow 87 tok/s to 938.68 tok/s, and my CPU is now at 0% during prefill.

P.S. I was using doccker image previously...

Thank you so much for the compile flag. You saved me a ton of time. Oh also, avx512 is "1" by default.

Edit: And I am at 32k context now :)

32G :(

9950X3D.

How much RAM you got with that?

…And have you ever considered running an MoE, with experts on the CPU?

They can be shockingly fast, as the attention and dense layers are all still processed on the GPU. You could also run a much less impactful quantization, especially for the dense layers (which are typically at Q6K or Q8_0 for MoEs, while the experts in CPU RAM can take heavier quantization).

I have 32G rn, planning to upgrade it to 64G soon once they get cheaper again. Meanwhile, I did some testing

q8_0 + q5_1

I just tried it but it is silently falling back to CPU for processing giving me 87 tok/s for prompt processing. Maybe q5_1 is not fully supported on AMD? I cannot find anything relevant :(

Plug your monitor into the motherboard

My mobo has just 1 hdmi and a usb 4 DP. I would need to buy a new cable... I would try it once I get one

P.S. 16k context window is not good for agentic coding. I am trying to increase it and see what I can do...

Meme Checkpoint

25d 18h ago in memes

How do you assume that it was only 2?

Does anybody know?

4mon 2d ago in lemmyshitpost from sh.itjust.works

I love how it just says no, prolly no life left XD

get out of my head

5mon 24d ago in lemmyshitpost from lemmy.ml

You gotta try hard enough

Questions on Light takes all paths

1y 1mon ago in askscience