Speculative Decoding Is Back on X Again

Speculative decoding is one of those ideas that sounds like a trick until you realise it is mostly just good systems engineering.

Putting the hype aside for a second, the problem is very simple: large language models are slow at generation because they usually generate one token at a time. Token 50 depends on token 49, token 49 depends on token 48, and so on. So even if you have an absurd amount of GPU or TPU compute sitting there, the core loop is still annoyingly sequential.

Speculative decoding asks a very natural question:

What if a cheap model guessed the next few tokens, and the expensive model only checked whether those guesses were valid?

If the guesses are right, you get multiple output tokens for roughly one expensive target-model step. If the guesses are wrong, you reject the bad guess and fall back to the target model. The clever part is that this can be done without changing the final output distribution, assuming the exact rejection-sampling version is implemented properly.

That last sentence is the whole reason the technique matters. This is not “make the model worse but faster.” It is “use extra cheap compute to reduce the number of expensive serial steps.”

One step of decoding

Autoregressive0 / 14
Speculative0 / 14
Step 0: autoregressive emits 1 token, speculative emits up to 4.

Let’s break down exactly how it works.

The actual bottleneck

A standard decoder-only LLM generates text autoregressively. Given a prefix, it predicts a probability distribution for the next token, samples or selects one token, appends it to the context, and repeats.

The loop looks like this:

while not done:
    logits = target_model(prefix)
    next_token = sample(logits)
    prefix.append(next_token)

This is brutally simple, but it has a structural issue. Generating K new tokens normally requires K serial calls to the large model. This is the bottleneck that the original speculative decoding paper directly targeted: decoding K tokens from large autoregressive Transformers takes K serial runs of the model.

This also creates a weird hardware mismatch. Modern accelerators are very good at parallel matrix multiplication, but token-by-token decoding often becomes memory-bandwidth bound. You keep loading huge model weights just to produce one more token. The processor is capable of much more parallel work than the decoding loop is giving it.

So the opportunity is obvious: if we can turn some of those serial target-model calls into parallel verification work, we get latency back.

The core idea

Speculative decoding uses two distributions:

  • p: the large target model we actually want to sample from
  • q: a cheaper draft model, draft head, n-gram heuristic, or some other fast proposer

The draft model generates a short candidate continuation:

prefix — The capital of France is
q proposesParis . The

Then the target model scores those proposed tokens in parallel.

This sounds impossible at first because LLMs are autoregressive. But the trick is that verification is different from generation. Once the candidate tokens are already known, the target model can be run in a teacher-forced mode over the whole proposed block. It computes:

P(“Paris”prefix)P(“.”prefix+“Paris”)P(“The”prefix+“Paris .”)\begin{aligned}
&P(\text{“Paris”} \mid \text{prefix}) \\
&P(\text{“.”} \mid \text{prefix} + \text{“Paris”}) \\
&P(\text{“The”} \mid \text{prefix} + \text{“Paris .”})
\end{aligned}

in a single target-model forward pass over the candidate sequence.

The model is still respecting causal order. We are not letting position 3 see the future. We are just evaluating many already-proposed positions in parallel, which accelerators are good at.

In the best case, the target accepts the whole draft block and also produces one extra “bonus” token. So instead of getting one token from one expensive target pass, we get K + 1 tokens.

Draft → Verify → Accept / Reject

starting…
prefix
The capital of France is
draft (q)
The capital of France is
target (p)
The capital of France is
output
The capital of France is
·draft proposal
·target output
·accepted
·rejected
·wasted
·fallback
+1·bonus token

The simple greedy version

For greedy decoding, where temperature is basically zero and the model always picks the highest-probability token, the intuition is very easy.

  1. The draft model proposes K tokens.
  2. The target model computes what it would have picked at each of those positions.
  3. We accept draft tokens until the first mismatch.
  4. At the mismatch, we use the target model’s token and continue.

Example:

q proposes
Paris.It
p says
Paris.The
accepted
Paris.
rejected
It
fallback
The

This already gives the main intuition: the cheap model runs ahead, and the large model acts like a verifier.

But sampling is harder. If the target model is sampling from probabilities instead of always taking argmax, then simply accepting exact matches would bias the distribution. This is where the actual speculative decoding algorithm becomes interesting.

The exact sampling trick

For stochastic decoding, the algorithm uses a modified rejection-sampling step.

Assume the draft model proposes token xix_i at position ii. The draft probability for that token is qi(xi)q_i(x_i). The target probability for the same token is pi(xi)p_i(x_i).

We accept the token with probability:

min ⁣(1,  pi(xi)qi(xi))\min\!\left(1,\; \frac{p_i(x_i)}{q_i(x_i)}\right)

If the target model likes the token at least as much as the draft model does, we always accept it. If the draft model overestimated that token, we only accept it sometimes.

If we reject, we do not just sample normally from pp. We sample from the corrected leftover distribution:

normalize ⁣(max(piqi,0))\operatorname{normalize}\!\bigl(\max(p_i - q_i,\, 0)\bigr)

That correction is what preserves the target model’s distribution. The accepted path contributes the probability mass where draft and target overlap. The rejection path fills in the remaining probability mass where the target wanted tokens more than the draft did.

In pseudocode, the shape is:

def speculative_step(prefix, target, draft, k):
    # 1. Draft K candidate tokens quickly.
    draft_tokens, draft_probs = draft.sample(prefix, k)
 
    # 2. Target scores all draft positions in one parallel pass.
    target_probs = target.score(prefix, draft_tokens)
 
    accepted = []
 
    for i, token in enumerate(draft_tokens):
        p = target_probs[i][token]
        q = draft_probs[i][token]
 
        accept_prob = min(1.0, p / q)
 
        if random() < accept_prob:
            accepted.append(token)
            continue
 
        # 3. Rejection: sample from corrected residual distribution.
        residual = normalize(relu(target_probs[i] - draft_probs[i]))
        fallback_token = sample(residual)
        return accepted + [fallback_token]
 
    # 4. If every draft token was accepted, emit the bonus target token.
    bonus = sample(target_probs[k])
    return accepted + [bonus]

That is speculative decoding in its cleanest form.

The draft model proposes. The target model verifies. Acceptance preserves as much cheap work as possible. Rejection corrects the distribution so that the final output is still distributed like the target model.

This is why the original paper and DeepMind’s follow-up keep describing the technique as “lossless.” They do not mean it magically produces the same exact text under every implementation detail. They mean the algorithm can preserve the target distribution without changing the model weights or intentionally lowering model quality.

Why this can be faster

The speedup comes from a trade: spend cheap compute to avoid expensive serial target-model calls.

Let’s say the draft model proposes 5 tokens. The target verifies them in one forward pass. If 4 are accepted, you just generated 4 or 5 useful tokens while only paying for one expensive target-model verification.

Speedup vs acceptance rate

win
Assumes draft cost ≈ 0.15 target step per drafted token. Real systems vary.
Expected tokens / target step/ 6 max
2.94
Speedup vs autoregressivefaster
1.68×
E[accept] = (1 − αK+1) / (1 − α)

How much speedup you actually get depends on a tangle of moving parts. The dominant one is the acceptance rate: when the draft predicts tokens the target also likes, you win, and when it keeps missing, you have just added overhead. The draft itself also has to be meaningfully cheaper than the target, otherwise the “optimization” turns into a new bottleneck. Speculation depth cuts both ways — too short and you leave easy speedup on the table, too long and you waste verification and drafting work past the first rejection. On top of that, the verification pass has to actually use the parallel compute that hardware exposes, which is why GPU and TPU implementation details end up mattering as much as the algorithm. Traffic pattern then shapes the rest: low-QPS, latency-sensitive, memory-bound workloads are the classic fit, though newer work is starting to challenge the idea that speculative decoding only helps at low concurrency.

This is also why benchmark claims vary so much. The algorithm is not a magic 3x button. It is a systems optimization whose gains depend on the target model, draft method, prompt distribution, sampling settings, batch size, and hardware.

Where it works well

Speculative decoding works best when the next tokens are relatively predictable.

This includes:

  • code completion
  • boilerplate text
  • structured responses
  • low-temperature chat
  • long completions where many local continuations are obvious
  • assistant workflows where inter-token latency actually matters

In these cases, a cheap draft model can often predict the target model’s next few tokens with high acceptance.

The classic example is language that has strong local structure:

Actions speak louder than ___

The draft model probably guesses words. The target model probably agrees. We should not burn a full large-model decode step just to discover the obvious.

Where it breaks down

There are also cases where speculative decoding is not worth it.

High-temperature sampling hurts acceptance because the target distribution is intentionally more random. Bad draft models hurt because they propose continuations the target rejects. Very high-throughput serving can reduce the latency gap because batching already keeps the target model busy. Some structured-output or constrained-decoding setups also complicate verification.

There is also an implementation caveat. “No quality loss” only applies to exact speculative decoding. Some production systems use relaxed acceptance rules, quantized probabilities, custom kernels, or approximate draft heads. Those may be perfectly reasonable engineering decisions, but they should not be confused with the clean theoretical guarantee.

So the practical rule is:

Speculative decoding is only good when accepted tokens are cheaper than serial target-model tokens.

Obvious, but easy to forget when speedup charts start trending.

The variants you keep seeing

The original version used a smaller draft model. Since then, the field has turned “drafting” into a whole design space.

A few important variants:

Draft-model speculation

The straightforward version. Use a smaller model to propose tokens and the larger model to verify them.

N-gram and suffix speculation

No trained draft model. The system guesses based on repeated patterns in the prompt or generated text. This is lightweight and easy to enable, but usually less powerful.

Medusa-style and multi-token heads

Instead of running a separate draft model, attach extra heads to the model that predict future tokens. This reduces overhead but requires model-specific training or support.

EAGLE-style speculation

EAGLE uses the target model’s internal features to produce better drafts. It is one of the main modern families because draft quality matters more than just draft speed.

MTP drafters

Multi-token prediction drafters are becoming more productised. Google’s Gemma 4 announcement, for example, released MTP drafters for the Gemma 4 family and reported up to 3x speedups in supported stacks without degradation in output quality or reasoning logic.

Parallel drafting

A hidden problem in some speculative methods is that the draft model itself may still generate draft tokens sequentially. If drafting 5 tokens requires 5 draft passes, the draft can become the new bottleneck. P-EAGLE attacks this by generating all K draft tokens in a single forward pass, with AWS reporting up to 1.69x speedup over vanilla EAGLE-3 on real workloads on NVIDIA B200.

Diffusion-style drafting

This is the more aggressive version: generate a whole block of candidate tokens in one shot. Google and UCSD recently highlighted DFlash on TPU v5p, reporting a 3.13x average tokens-per-second increase and peak speedups near 6x on complex math tasks.

Speculative speculative decoding

Yes, the name is ridiculous. The idea is not.

Normal speculative decoding still has a sequential dependency:

  1. draft
  2. verify
  3. wait for verification result
  4. draft again

Speculative speculative decoding asks: why should the drafter wait?

While the target model is verifying the current draft, the draft side predicts the likely verification outcomes and prepares future speculations ahead of time. If the actual verification result is one of the predicted outcomes, the next draft is already ready. Kumar, Dao, and May call their optimized algorithm Saguaro, and report that it is on average 30% faster than the strongest speculative decoding baselines and up to 5x faster than autoregressive decoding with open-source inference engines.

That is the kind of result that gets the inference crowd posting again.

So why is it picking up steam again?

There are a few reasons, and none of them are random.

First, inference cost is becoming impossible to ignore. Training gets the headlines, but serving is where products bleed money every day. A 20% serving-cost improvement matters at scale. Red Hat recently benchmarked Eagle3 speculative decoding with gpt-oss-120B in vLLM and reported a 20.5% output-throughput improvement on SWE-bench, plus a 19.4% reduction in cost per 1M output tokens at peak utilisation. That is not just a paper number. That is an infra bill number.

Second, agents and voice interfaces make latency visible. Nobody cares if a batch job finishes slightly faster in the background. But with coding agents, browser agents, terminal agents, voice assistants, and interactive copilots, inter-token latency directly affects how “alive” the product feels.

Third, the serving stacks are finally mature enough. vLLM now documents multiple speculative methods including EAGLE, MTP, draft models, PARD, MLP, n-gram, and suffix decoding. Hugging Face has assisted decoding. SGLang and TensorRT-LLM have their own speculative paths. This has moved from “cool paper” to “thing you can actually try without writing the whole inference engine yourself.”

Fourth, the new work is attacking the second-order bottlenecks. The original idea was “draft then verify.” The 2026 discussion is more like:

  • can we make drafting parallel?
  • can we hide drafting latency completely?
  • can we use target hidden states to improve acceptance?
  • can we train draft heads that ship with the model?
  • can we make this work under high concurrency?
  • can we map it cleanly to B200s, H200s, TPUs, and consumer GPUs?

That is a much more interesting engineering frontier.

Finally, optimizations with a simple story and a big number attached travel fast. “Same model, same output distribution, 2x faster” is just extremely postable. The recent Speculative Speculative Decoding announcement from Tanishq Kumar claimed SSD was up to 2x faster than the strongest inference engines, which is exactly the kind of sentence that spreads in ML systems circles.

Does it actually matter?

Somewhat, with caveats.

I would not treat any single speculative decoding benchmark as universally transferable. A result on Llama with one draft model on H100s does not automatically tell you what happens with Qwen, DeepSeek, Gemma, GPT-OSS, or your weird internal fine-tune under production traffic.

But I do think the direction is real.

Google Research’s retrospective says speculative decoding has been used across Google products and was part of optimizations for AI Overviews, while the original work showed roughly 2x-3x improvements on translation and summarisation tasks. DeepMind’s speculative sampling paper reported 2x-2.5x decoding speedups with Chinchilla 70B in a distributed setup. More recent production-oriented work is reporting smaller but economically meaningful gains, like Red Hat’s 19.4% cost reduction result, while newer research variants — parallel drafting, diffusion-style speculation, and Saguaro — are chasing much larger speedups by removing draft bottlenecks entirely.

The correct read is not “speculative decoding always gives 3x.”

The correct read is:

The autoregressive decode loop is structurally inefficient, and speculative methods are becoming one of the main ways inference stacks attack that inefficiency.

That matters because inference is turning into a competitive layer. The model weights are only one part of the product. The serving engine, cache layout, batching policy, quantization scheme, draft model, verifier kernel, and hardware mapping all affect user experience and cost.

Where this probably goes

I think speculative decoding becomes boring infrastructure.

Not boring as in unimportant. Boring as in eventually expected.

KV caching used to be something you had to explain. Now it is table stakes. Continuous batching used to sound specialised. Now every serious inference stack has a story around it. Speculative decoding is likely going the same direction.

Models will increasingly ship with their own draft heads or companion drafters. Serving engines will auto-select speculation depth based on acceptance rates and load. Hardware-specific kernels will matter more. And the “best” inference setup will be less about one trick and more about how all of these optimizations compose.

The interesting part is that speculative decoding changes how we think about model deployment. We are not only asking “which model is smartest?” anymore. We are asking:

  • how much of its next-token work is predictable?
  • can cheaper compute run ahead of expensive compute?
  • can verification be batched or overlapped?
  • can we trade unused parallelism for lower latency?
  • can we reduce cost without touching model quality?

That is why the topic is back.

Not because speculative decoding is new. It is not. The main papers are from 2022 and 2023.

It is back because the ecosystem around it finally looks usable, the economics are getting painful, and the new variants are attacking the bottlenecks inside the bottleneck.

Frankly, that is the part I find exciting. The low-hanging fruit in LLM products is no longer just prompting a better model. It is building systems that make expensive intelligence feel cheap and instant.

Speculative decoding is one of the cleaner examples of that shift.