Run Qwen3.6 MTP GGUFs locally ~1.4–2.2× faster with no accuracy loss and with only 18gb VRAM
1mon 16h ago by lemmy.ml/u/yogthos in artificial_intel@lemmy.ml from huggingface.co
The change is a result of MTP support landing in llama.cpp. The Qwen3.6 Unsloth GGUFs are now out of experimental mode, with llama.cpp has merged many PRs, and MTP is now properly supported in Unsloth.
I've been using qwen3.6 35b since it came out with really good results, this is a cherry on top. Thanks for sharing!
I find it's the first model I can run locally that actually feels genuinely useful for coding. I'm really excited about what things might look like in another year. We might really get current frontier model performance on a laptop at this rate.
Unfortunately the new MTP model doesn't work yet with my LM Studio + Claudish + Claude Code setup. Will need to wait for support to be merged into the lm Studio ROCm llama.cpp runtime.
I finally got lm studio working with my local setup (after their 0.4.14 beta release and llama.cpp beta update)
I'm using Claudish as an interface from Claude Code and VS Studio to LM Studio, along with qwen 3.6 35B MTP on my 7900xtx.
Here is what Gemini had to say evaluating my performance numbers -
The Ultimate Takeaway
You have built a local architecture that rivals a multi-billion dollar cloud infrastructure for day-to-day development. The fact that you have a 35B model executing with a 72.1% MTP acceptance rate means your hardware is perfectly tuned to the way code is written.
I find it absolutely incredible that just within a few years we went from need a data centre to running this stuff on a laptop. And there are still a whole bunch of papers that have come out in just past few months that haven't even been adopted yet. Like the whole MTP thing itself is completely new, and we see some new breakthrough like it happening every few months.
I really think the whole AI as a service business model is dead in the water at this point. I'm sure there's going to be some market for it, but running models locally is just a superior experience in most cases. You get to keep your data on your machine, you don't have to worry about the provider changing things from under you, or your costs changing. Once it works, it's predictable and you own it.
Agreed! It really is neat to be present and participating during this time. I know the future will hold great things but it's crazy how quickly things move, to your point.
I know there will be some demand for turnkey AI solutions as people not like us won't have the time or patience (or hardware) to make it work, but it's so rewarding when it does work.
And boy does it work!
For sure, it's pretty magical, and I feel like this year has been a real breakthrough for local models where they really can do non-trivial work. I'm really excited to see what things look like by next year.
@davel@lemmy.ml the requirements for running Qwen just got significantly lower, it's basically the best local model at the moment
Thanks. I haven’t bought hardware to run things locally yet. I did buy some DeepSeek tokens this weekend to play around with. Maybe I should rent until the bubble pops and then buy a supercomputer at fire sale prices.
Oh yeah, that's definitely the best approach if you don't already have the hardware since DeepSeek is just absurdly cheap to use. Eventually, hardware prices are going to come down, and local models are going to keep getting more efficient too. So, dumping a few grand on a rig right now doesn't really make much sense.