Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution - AOS for Lemmy.World - A generic Lemmy server for everyone to use.

104

Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution

1mon 2d ago by mander.xyz/u/BB84 in localllama@sh.itjust.works from github.com

Crossposted from https://lemmy.ml/post/47429470

My oversimplified and possibly wrong understanding: this is like speculative decoding, but instead of a separate draft model (which does its own prompt processing), they use some diffusion thing strapped on top of the main model. The diffusion reuses the high-quality prompt processing result of the main model.

The 7.8x faster claim sounds almost too good to be true. But even if we get like 3x then this is still a huge revolution in localLLMing.

I was just about to make a post asking for the best small model after finding out Qwen3-27B was way too slow, so Orthrus-Qwen3-8B looks like a pretty appealing option.

They said they're working on Orthus for Qwen 3.5. It'll be amazing!

Yeah, unfortunately it seems this can't be converted to a llama.cpp compatible format yet, and that's pretty big a tradeoff right now. Not surprising with how new it is, but we'll have to wait to combine it with other improvements. Pretty exciting for the future though.

Update: I actually couldn't get this to run even on HuggingFace Transformers. I made a bug report, but basically I'm getting some torch incompatibilities with flash-attn. Maybe this is a known issue for more experienced folks, but I couldn't solve it.