$ ls ./menu

© 2025 ESSA MAMDANI

cd ../blog
4 min read
AI/ML

DFlash: Pushing Gemma 4 to Warp Speed

> When Google shipped native multi-token prediction in Gemma 4, the community celebrated. Then DFlash dropped — achieving up to 6x lossless acceleration with block diffusion speculative decoding. Here's how it works and why it matters.

Audio version coming soon
DFlash: Pushing Gemma 4 to Warp Speed
Verified by Essa Mamdani

DFlash: Pushing Gemma 4 to Warp Speed

When Google shipped native multi-token prediction in Gemma 4, the community celebrated. Then DFlash dropped — and turned that celebration into a full-blown acceleration party.

What Just Happened?

Google's Gemma 4 family landed this week with some serious firepower:

  • 26B-A4B MoE (4B active params)
  • 31B Dense
  • E2B and E4B on-device variants
  • Native Multi-Token Prediction (MTP) — finally baked into the architecture

MTP is a game-changer because it breaks the fundamental bottleneck of autoregressive decoding: generating tokens one at a time. Instead of the target model consuming tokens sequentially, MTP lets it verify multiple future tokens in parallel. That's a 2-3x speedup right out of the box.

But here's the thing: MTP is just the beginning.

Enter DFlash: Block Diffusion for Flash Speculative Decoding

DFlash, an open-source project from Z-Lab, just released dedicated draft models for both Gemma-4-26B-A4B-it and Gemma-4-31B-it. And the numbers are insane.

The Core Innovation

While EAGLE-3 (the current SOTA speculative decoding method) still drafts autoregressively — meaning its draft model generates tokens one by one — DFlash uses a lightweight block diffusion model to draft an entire block of tokens in a single parallel forward pass.

Think of it this way:

  • EAGLE-3: "Let me guess the next token... ok, now the next one... now the next one..."
  • DFlash: "Here's all 16 tokens at once. Verify them, boss."

The Secret Sauce: Target Conditioning

A naive diffusion drafter would struggle because it lacks the target model's reasoning. DFlash solves this elegantly by extracting hidden features from the target model and injecting them into every layer of the draft model's KV cache. This means:

  • The drafter borrows the target's deep reasoning
  • Every draft layer gets full context (not just the first layer)
  • Acceptance rates scale with model depth — not diminish

The draft model reuses the target's embedding and LM head. Only the intermediate layers are trained. Minimal parameters. Maximum speed.

Benchmarks: Lossless 6x Acceleration

On Qwen3-8B (a comparable architecture), DFlash achieves up to 6.17x lossless speedup — that's 2.5x faster than EAGLE-3:

BenchmarkEAGLE-3 SpeedupDFlash Speedup
GSM8K2.13x5.20x
MATH-5002.18x6.17x
AIME242.25x5.91x
AIME252.18x5.85x
HumanEval2.48x5.20x
MBPP2.27x4.75x

And it maintains strong performance under sampling (temperature=1) and with reasoning/thinking mode enabled (~4.5x for reasoning models).

Gemma 4 + DFlash = Production Ready

What makes this particularly exciting for Gemma 4 users:

  1. vLLM IntegrationvLLM v0.20.1+ includes core DFlash support, with a temporary Docker build for Gemma 4:

    bash
    1docker pull ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130
  2. SGLang Production Serving — Full speculative algorithm support with DFLASH flag

  3. MLX for Apple Silicon — Tested on M5 Pro, supports Gemma-4 models

  4. Hugging Face Transformers — For fast experimentation

Quick Start with Gemma 4 on vLLM

bash
1docker run --rm -it \
2  --gpus all \
3  --ipc=host \
4  --shm-size=16g \
5  -p 8000:8000 \
6  -v ~/.cache/huggingface:/root/.cache/huggingface \
7  ghcr.io/z-lab/vllm-openai:gemma4-dflash-cu130 \
8  google/gemma-4-26B-A4B-it \
9  --host 0.0.0.0 \
10  --port 8000 \
11  --speculative-config '{"method": "dflash", "model": "z-lab/gemma-4-26B-A4B-it-DFlash", "num_speculative_tokens": 15, "attention_backend": "flash_attn"}' \
12  --attention-backend triton_attn \
13  --max-num-batched-tokens 32768 \
14  --trust-remote-code

Why This Matters for the Open Source Community

DFlash represents a fundamental shift in how we think about inference optimization:

  • No quality loss — Speculative verification guarantees the target model's output quality
  • No massive drafters — Unlike DiffuSpec or SpecDiff-2 that use 7B-parameter drafters, DFlash stays lightweight
  • Universal applicability — Works across model families (Gemma 4, Qwen3.5, Kimi K2.5, GPT-OSS, and more)
  • Production ready — Native integration with the two most popular serving frameworks

The Z-Lab team is also planning to open-source the training recipe, meaning you'll be able to train your own DFlash draft models for any LLM.

The Bottom Line

Google built MTP into Gemma 4's architecture. That's a 2-3x win. DFlash takes that foundation and pushes it to 6x — all while staying open source, lightweight, and compatible with your existing serving infrastructure.

If you're running Gemma 4 in production, DFlash isn't just a nice-to-have. It's the difference between "fast" and "warp speed."


Resources: