Qwen 3.6 27B running at 46 tok/s on an RX 9070 XT (llama.cpp + MTP Speculative Decoding is basically magic) - AOS for Lemmy.World - A generic Lemmy server for everyone to use.

6117

Qwen 3.6 27B running at 46 tok/s on an RX 9070 XT (llama.cpp + MTP Speculative Decoding is basically magic)

3d 19h ago by ani.social/u/cicadagen in localllama@sh.itjust.works

Just got my hands on a new AMD Radeon RX 9070 XT (16GB) and wanted to share some inference numbers. I've been messing around with llama.cpp via their official ROCm Docker image, testing out Qwen 3.6 27B (Omnimerge-v4) in IQ3_M.

Honestly, the performance you can squeeze out of a 27B model on a 16GB consumer card right now is blowing my mind.

Here’s the breakdown: The Setup

GPU: AMD Radeon RX 9070 XT (RDNA4 / gfx1201) - 16 GB VRAM
CPU: AMD Ryzen 9 9950X3D
OS/Backend: Linux via Docker using ghcr.io/ggml-org/llama.cpp:server-rocm. (Props to the devs, it natively supports RDNA4 gfx1201 out of the box!)
Model: Qwen3.6-27B-Omnimerge-v4-IQ3_M.gguf (~13 GB)
Context: 16k

Tweaks:

Set -np 1 since I'm just running it as a single-user chatbot in Open WebUI.
Slapped on 8-bit KV cache (--cache-type-k q8_0 --cache-type-v q8_0) to save about 50% VRAM.
Enabled MTP Speculative Decoding (--spec-type draft-mtp).

The Numbers (512-token test)

Prompt Processing (TTFT): 549.27 tok/s (1220 tokens evaluated in ~2.2s). Latency: 1.82 ms/token.
Text Gen: 46.06 tok/s (512 tokens generated in ~11.1s). Latency: 21.71 ms/token.
MTP Stats: Draft acceptance rate was super high at 62.7% (333 drafts accepted / 531 generated). Aggregate speed (including prompt eval) hit roughly 48.97 tok/s.

Memory Footprint

VRAM: Sitting at 14.46 GB out of 16 GB. This leaves about 1.5 GB of breathing room, which has been totally stable with zero OOM crashes so far.
System RAM: ~4.3 GB (mostly the host-side prompt cache helping speed up subsequent turns).

RDNA4 is ready: The latest ROCm images and llama.cpp HIP libs support the 9070 XT natively. Didn't even need to mess with HSA_OVERRIDE_GFX_VERSION.

KV Cache quantization is required: Pushing the KV cache to q8_0 is the only reason a 16k context window fits on a 16GB card alongside a 27B model.

If anyone with a 16GB card is looking for the sweet spot, this has to be one of the best price-to-performance setups available right now. Let me know if you want my docker-compose.yml or run scripts...

Edit: To see how much MTP actually helps, I ran the exact same 512-token prompt test with speculative decoding toggled off:

Standard Inference (No Speculation):

Speed: 25.88 tokens/sec (Latency: 38.64 ms/token)
Prompt Eval: 239.46 tokens/sec

MTP Speculative Decoding (MTP):

Speed: 46.06 tokens/sec (Latency: 21.71 ms/token)
Prompt Eval: 549.27 tokens/sec
Draft Acceptance Rate: 62.7%

For more info, here are my logs

Logs

ROCm GPU and System Info:

device_info:

    - ROCm0   : AMD Radeon RX 9070 XT (16304 MiB, 16200 MiB free)
    - ROCm1   : AMD Ryzen 9 9950X3D 16-Core Processor (15589 MiB, 25012 MiB free)
    - CPU     : AMD Ryzen 9 9950X3D 16-Core Processor (31178 MiB, 31178 MiB free)

 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 |        
AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1

Context Initialization (16k single-slot & 8-bit KV Cache):

  I srv    load_model: [spec] estimated memory usage of MTP context is 178.02 MiB
  W llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
  I srv    load_model: initializing slots, n_slots = 1
  I slot   load_model: id  0 | task -1 | new slot, n_ctx = 16384
  I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
  I srv  llama_server: model loaded
  I srv  llama_server: server is listening on http://0.0.0.0:8080/

Benchmark Request Timing Statistics (Generating 512 tokens):

  I slot launch_slot_: id  0 | task 3454 | processing task, is_child = 0
  I slot print_timing: id  0 | task 3454 | n_decoded =    103, tg =  37.93 t/s
  I slot print_timing: id  0 | task 3454 | n_decoded =    247, tg =  43.16 t/s
  I slot print_timing: id  0 | task 3454 | n_decoded =    392, tg =  44.87 t/s
  I slot print_timing: id  0 | task 3454 | prompt eval time =     195.88 ms /    42 tokens (    4.66 ms per token,   214.41 tokens per second)
  I slot print_timing: id  0 | task 3454 |        eval time =   11116.24 ms /   512 tokens (   21.71 ms per token,    46.06 tokens per second)
  I slot print_timing: id  0 | task 3454 |       total time =   11312.12 ms /   554 tokens
  I slot print_timing: id  0 | task 3454 |    graphs reused =       3560
  I slot print_timing: id  0 | task 3454 | draft acceptance = 0.62712 (  333 accepted /   531 generated)
  I statistics        draft-mtp: #calls(b,g,a) =    6   3607   3607, #gen drafts =   3607, #acc drafts =  2824, #gen tokens =  10820, #acc tokens =  6605, dur(b,g,a) = 0.008, 31925.089, 2.740 ms
  I slot      release: id  0 | task 3454 | stop processing: n_tokens = 553, truncated = 0

P.S. 16k context window is not good for agentic coding. I am trying to increase it and see what I can do...

This is what I was gonna mention. Even with a 24gb 4090 I have no context room. I run it 8-bit quant on a DGX Spark for that. It’s only 13 tps but 256k context.

Try q8_0 + q5_1 cache. The V cache is much less sensitive to quantization.

Also, use that IGP on your 9950X3D! Plug your monitor into the motherboard, and free up vram on the 9070, so you can use every last megabyte.

q8_0 + q5_1

I just tried it but it is silently falling back to CPU for processing giving me 87 tok/s for prompt processing. Maybe q5_1 is not fully supported on AMD? I cannot find anything relevant :(

Plug your monitor into the motherboard

My mobo has just 1 hdmi and a usb 4 DP. I would need to buy a new cable... I would try it once I get one

I dunno where you got your llama.cpp binary from, but all the fa kernel “combinations” need to be compiled, and maybe q8_0/q5_1 isn’t compiled by default?

If you compile it yourself, there may be an “all_quants” flag or something similar you have to enable.

As an aside, be sure enable the avx512 flags as well. Ryzen 9000 benefits from them quite a bit.

You are an absolute legend! You were 100% right.

I built llama.cpp from source just like you suggested, using the -DGGML_CUDA_FA_ALL_QUANTS=ON and AVX-512 flags. The difference is insane! The silent CPU fallback is completely gone. My prompt processing jumped from a slow 87 tok/s to 938.68 tok/s, and my CPU is now at 0% during prefill.

P.S. I was using doccker image previously...

Thank you so much for the compile flag. You saved me a ton of time. Oh also, avx512 is "1" by default.

Edit: And I am at 32k context now :)

I checked, and the CMAKE flag you want to enable Q8/Q5 is:

https://github.com/ggml-org/llama.cpp/blob/1fd6dfe9f3d4b69cce101d832339fbda2d14b056/ggml/CMakeLists.txt#L208

And all these AVX512 ones:

https://github.com/ggml-org/llama.cpp/blob/1fd6dfe9f3d4b69cce101d832339fbda2d14b056/ggml/CMakeLists.txt#L159

MTP is not always the best choice. You'll get faster prefill and free up a good chunk of memory.

Or at least quantise KV-cache for the MTP-model all the way down to q4_0.

Qwen3.6-27B is probably the strongest model you can run on just one consumer graphics card. Gemma-4-12B QAT is quite nice if you want a bit of parallelism and huge context. Needs a custom chat template to use tools correctly.

KV-cache for the MTP-model all the way down to q4_0.

From the other reply from brucethemoose@lemmy.world, I did try q8_0/q5_1. It works pretty well.

I'll try Gemma-4-12B QAT when I get time :)

Yeah.

If you really want speed, 12B QAT is a good one for vllm, right?

https://huggingface.co/google/gemma-4-12B-it-qat-w4a16-ct/tree/main

Though admittedly I have no idea if vllm works on 9070s these days, and you will get less context.

9950X3D.

How much RAM you got with that?

…And have you ever considered running an MoE, with experts on the CPU?

They can be shockingly fast, as the attention and dense layers are all still processed on the GPU. You could also run a much less impactful quantization, especially for the dense layers (which are typically at Q6K or Q8_0 for MoEs, while the experts in CPU RAM can take heavier quantization).

Yeah a 16G 9070 tells me Qwen 3.6 35B A3B. There's space for the model and context.

9950X3D.
How much RAM you got with that?

…And have you ever considered running an MoE, with experts on the CPU?

They can be shockingly fast, as the attention and dense layers are all still processed on the GPU. You could also run a much less impactful quantization, especially for the dense layers (which are typically at Q6K or Q8_0 for MoEs, while the experts in CPU RAM can take heavier quantization).

I have 32G rn, planning to upgrade it to 64G soon once they get cheaper again. Meanwhile, I did some testing

Or potentially a ~120B model if OP has 64GB RAM.

32G :(

Did you follow a guide for setting up Speculative Decoding? I haven't gotten it to work very well personally. Does the smaller model run on your CPU memory and the larger one fully on GPU?

Did you follow a guide for setting up Speculative Decoding? I haven’t gotten it to work very well personally. Does the smaller model run on your CPU memory and the larger one fully on GPU?

No, I actually don't run a separate draft model on the CPU...

Since I'm using Qwen 3.6 27B, I'm utilizing MTP (Multi-Token Prediction) speculative decoding. This is built directly into the main model itself, so there is no extra "small model" to load or offload to the CPU. Everything runs entirely on the GPU VRAM.

To get it working well in llama.cpp, you just need two things:

Make sure you are using a model variant specifically trained for it (look for files with -MTP- in the name on Hugging Face, like the Unsloth ones).
Add the flag --spec-type draft-mtp to your startup command for docker. But I do suggest compiling llama.cpp yourself now, for better kv caching.

That’s pretty much it! Because everything stays in VRAM and uses the main model's native architecture, the draft acceptance rate is super high (around 64% for me) and it basically doubles the generation speed.