Inception Labs' Mercury 2 AI Beats Google's DiffusionGemma at Its Own Game

In brief

Inception Labs’ Mercury 2 generates roughly 1,000 tokens per second and scored 90 on the AIME 2026
Google’s recent DiffusionGemma hits similar speeds but performs worse on benchmarks.
DiffusionGemma is free and open-weight on Hugging Face. Mercury 2 is a paid, closed-weight API model.

Inception Labs introduced Mercury 2 on Thursday, calling it the world’s fastest reasoning language model. Per the company’s announcement, it generates about 1,000 tokens per second—the chunks of text an AI model reads and writes—against roughly 89 tokens per second for Anthropic’s Claude Haiku 4.5 Reasoning and 71 for OpenAI’s GPT-5 Mini.

That puts it in the same speed bracket Google would later claim for DiffusionGemma.

Welcome to the diffusion era.

We bet on parallel generation years ago, when it was a contrarian idea. It’s great to see the industry arrive.

Mercury 2 continues to lead the Pareto frontier for quality, speed, and cost among publicly available diffusion LLMs. pic.twitter.com/qSHuiR7vmH

— Inception (@_inception_ai) June 18, 2026

Both models get there by dropping the typewriter approach to writing. A standard chatbot writes one word, checks what it just wrote, then writes the next, looping until the answer is finished. Diffusion models instead fill a block of text with random placeholder tokens and erase the noise across a handful of parallel passes—the same trick that turns static into a photo in image generators like Stable Diffusion—until the whole block locks into a finished response at once.

Where the two diverge is what survives that process. On AIME 2026—built from real American Invitational Mathematics Examination problems and scored as the percentage solved correctly—Mercury 2 hit 90%. Google tested DiffusionGemma on the same set, where it scored 69.1%, while standard, non-diffusion Gemma 4 scored 88.3% on the same test.

On GPQA, a PhD-level science benchmark scored the same way, the two models nearly tie: Mercury 2 at 77% against DiffusionGemma’s 73.2%. But Google’s own developer guide recommends standard Gemma 4 for applications that demand maximum quality, conceding DiffusionGemma trails it across the board.

The speed claim holds up outside the lab, too. Augment Code, an AI coding-agent company, swapped Mercury 2 in for Anthropic’s Claude Opus 4.7 on its context-compaction subagent and saw an 82% drop in latency and a 90% cut in cost, while reporting the same output quality, according to a joint case study.

Inception was built on research from its founder Stefano Ermon, a Stanford professor who co-authored some of the score-based diffusion techniques that power today’s image generators. The startup’s $50 million funding round drew backing from Nvidia’s venture arm and individual investors Andrew Ng and Andrej Karpathy.

For non-technical users, the big thing most people don’t notice until they feel it is the “flow.” Traditional models make you wait between thoughts in a long session. Diffusion models like this make the AI feel like it’s keeping pace with you—instant autocomplete, rapid iterations on code or plans, and sub-agents that can handle the boring high-volume work without dragging the whole system down.

That subagent layer is the interesting architectural shift. Complex AI systems aren’t one giant smart model anymore. They’re orchestras of specialized helpers: one for deep reasoning, several for quick summarization, routing, tool lookup, output checking, etc. Sequential models make those utility calls expensive and slow. Parallel diffusion ones make them cheap and fast enough to use liberally.

Realistic caveats for regular users: These are still best for speed-sensitive, high-volume parts of workflows rather than the absolute hardest frontier reasoning (where the biggest AR models may still have an edge for now). Mercury 2 isn’t open weights, so it’s API/cloud for now. And like Google’s version, the full ecosystem (local runtimes, agent frameworks) is still catching up to make it seamless everywhere.

Use cases that pop immediately: real-time quick programming and “vibe coding” where the model keeps up with your edits, multi-agent coding or support systems where lots of fast sub-calls happen, voice interfaces that don’t feel laggy, and any latency-sensitive autocomplete or next-action prediction. At scale, the cost and energy savings from higher throughput on standard hardware add up fast.

The numbers Inception shares (and the independent evals) make the case visually: Mercury 2 sits in the “fast and good” quadrant for diffusion models, pushing what used to require exotic hardware down to commodity GPUs.