Changes from previous version
First Gemma variant using text diffusion rather than token-by-token autoregression. Trades some output quality for dramatically faster parallel decoding and bi-directional context, enabling tasks like Sudoku and code infilling that sequential models struggle with.
Release Summary
Experimental open Gemma model that generates text via diffusion instead of autoregressive decoding. 26B MoE (3.8B active) built on Gemma 4 and Gemini Diffusion research, released under Apache 2.0. Delivers up to 4x faster token generation on dedicated GPUs (1000+ tok/s on H100, 700+ on RTX 5090) by drafting 256-token blocks in parallel with bi-directional attention. Fits in 18GB VRAM when quantized. Best for speed-critical local workflows like in-line editing, code infilling, and rapid iteration; standard Gemma 4 remains recommended for maximum output quality.
Timeline
DiffusionGemma released
Google releases DiffusionGemma as an experimental open model on Hugging Face with MLX, vLLM, and Transformers integrations.