Google's Gemma 4 gains speed boost with Multi-Token Prediction drafters
Google rolled out Multi-Token Prediction (MTP) drafters for its Gemma 4 family of open models this spring, aiming to cut the latency of locally executed AI workloads. The new capability relies on speculative decoding: a small draft model predicts a handful of future tokens while the main model finishes processing the current one. By filling the inevitable idle time in the compute pipeline, MTP can double the token‑per‑second rate without sacrificing output quality.
Gemma 4 models share the underlying technology that powers Google’s flagship Gemini system, but they are tuned for edge deployment. While Gemini runs on Google’s custom Tensor Processing Units (TPUs) within massive data‑center clusters, Gemma 4 can operate on a single high‑power accelerator at full precision. Quantization further reduces the footprint, enabling the largest 26‑billion‑parameter model to run on consumer‑grade GPUs such as the NVIDIA RTX PRO 6000.
The move toward local AI reflects a growing demand for privacy‑preserving computation. By keeping data on‑device, developers avoid sending sensitive information to cloud services. Google’s decision to relicense Gemma 4 under Apache 2.0 reinforces that strategy, replacing a custom, more restrictive license used for earlier releases.
Typical consumer hardware, however, lacks the high‑bandwidth memory (HBM) found in enterprise‑grade machines. As a result, processors spend a disproportionate amount of time shuffling model parameters between VRAM and compute units for each token generated. MTP addresses this bottleneck by deploying a lightweight drafter—only 74 million parameters in the Gemma 4 E2B version—to generate speculative tokens during those memory‑transfer cycles.
The drafter shares the key‑value cache with the main model, eliminating the need to recompute context that the larger model has already established. Additionally, the E2B and E4B drafters employ a sparse decoding technique that narrows the search space to the most likely token clusters, further accelerating the process.
Benchmarks on an NVIDIA RTX PRO 6000 show the standard inference path for the 26B Gemma 4 model producing roughly half the throughput of the MTP‑enabled path, while maintaining comparable output quality. In practical terms, users can expect the same answers in about half the time, a meaningful improvement for interactive applications such as chatbots, code assistants, and real‑time translation tools.
Google’s announcement positions Gemma 4 as a more viable option for developers who want the power of a large language model without committing to cloud‑based inference. By combining open licensing, hardware flexibility, and a speed‑enhancing speculative decoder, the company hopes to spur broader experimentation at the edge.
Used: News Factory APP - news discovery and automation - ChatGPT for Business