DiffusionGemma is Googleโs fastest AI yet, but it comes with a big trade-off
Affiliate links on Android Authority may earn us a commission. Learn more. Google has released DiffusionGemma, an experimental AI model that takes a very different approach to how most chatbots generate text today. Instead of writing one word after another in a strict sequence,
Affiliate links on Android Authority may earn us a commission. Learn more.
Google has released DiffusionGemma, an experimental AI model that takes a very different approach to how most chatbots generate text today. Instead of writing one word after another in a strict sequence, it generates a whole block of text at once and then keeps refining it until it becomes readable. The idea is to push for speed and hardware efficiency, even if it means giving up some polish in the final output.
This new AI model is open-sourced under the Apache 2.0 license and is aimed at developers and researchers rather than everyday users. To understand why this matters, it helps to look at how most large language models work. Systems like Googleโs Gemma 4 generate text step by step, one token at a time. Each new word depends on what came before it, which makes the process inherently sequential and harder to speed up.
DiffusionGemma, on the other hand, starts with a full canvas of random tokens, essentially noisy, unreadable text, and then repeatedly cleans it up in multiple passes. With each pass, the output becomes more structured and coherent until it settles into a final response. A simple way to picture it is that traditional models write, while DiffusionGemma drafts and edits everything at once.
That shift has a direct impact on performance. Per Googleโs claims, DiffusionGemma can be up to four times faster than standard autoregressive models in low-concurrency scenarios, where a single user or process uses the GPU. On high-end hardware, the numbers are even more aggressive. The company asserts more than 1,000 tokens per second on an NVIDIA H100 and over 700 tokens per second on an RTX 5090.
Under the hood, DiffusionGemma is a 26-billion-parameter Mixture-of-Experts model, but it does not activate all of that at once. Only about 3.8 billion parameters are used during inference, helping keep compute requirements manageable. Google says this makes it possible to run the model on high-end consumer GPUs when quantized, with a memory footprint of around 18GB VRAM.
Where things get more interesting is how the model actually generates text. It can produce up to 256 tokens in parallel in a single step, and each token can attend to every other token in the block. That gives the model a global view of the output instead of a strictly linear one.
This makes it better suited for structured or rule-based tasks. For example, it can help fill in missing sections of code, complete structured formats like JSON, work through logic-heavy problems such as Sudoku-style puzzles, or handle mathematical patterns where consistency across the whole output matters more than sentence-by-sentence flow. Because it sees the entire block at once, it can also correct contradictions within the same generation cycle, rather than waiting for a later token to fix them.

