Future Token Prediction Accelerates Gemma

Turbocharged Local AI: Google Speeds Up Gemma 4 by 3x

Google Gemma 4, Google AI, local AI, Google Gemma 4 speed improvement MTP, Multi-Token Prediction AI optimization, Gemma 4, Google
Facebook
X
LinkedIn
Reddit
WhatsApp
Source: Google

Google Gemma 4 is getting a major performance boost through Multi-Token Prediction (MTP). The new approach allows local AI models to run up to three times faster without compromising output quality.

Google has released the latest open-source models in its Gemma 4 series. The company is now following up with a technological upgrade designed to significantly improve the performance of local artificial intelligence (Edge AI). By introducing so-called “Multi-Token Prediction” (MTP) drafters for Gemma, Google dramatically increases the generation speed of AI models running directly on end devices. According to official figures, the optimized models can achieve up to three times higher speeds compared to conventional token generation, while maintaining the same level of output quality.

Ad

Multi-Token Prediction as an Efficiency Lever

Traditional large language models (LLMs) such as Gemma or Gemini generate text autoregressively. In practice, this means the system produces one token at a time based on the previously generated fragment. Each step requires the same amount of computational effort, regardless of whether the model is processing a complex logical argument or simply generating filler words.

This is where MTP technology comes into play. Instead of working strictly linearly, the experimental models use a form of speculative decoding. A lightweight “drafter model” predicts likely future tokens in advance. While the primary model (target model) handles the heavy computation, the drafter simultaneously prepares several possible continuations. This approach improves utilization of compute cores, which in conventional workflows often sit idle while waiting for data transfers from memory.

Shared “Key Value Cache”

At the core of MTP is the collaboration between the main model and a much smaller auxiliary model. Within the Gemma 4 family, these drafter models contain as few as 74 million parameters, for example in the case of Gemma 4 E2B. Despite their compact size, they are highly optimized. One key advantage is the shared “Key Value Cache” — effectively the AI’s active working memory. The drafter does not need to recompute context that the primary model has already processed.

Ad

In addition, the E2B and E4B drafters use a technique called “Sparse Decoding” to narrow down clusters of likely tokens. The drafts generated by the auxiliary model are then verified by the actual Gemma model in a single compute pass. If the main model confirms the prediction, the entire sequence is accepted instantly. At the same time, the system generates another standard token in parallel. If the prediction fails, it is discarded and the model falls back to the conventional process. Because every step is validated by the main model, the error rate remains on par with standard inference.

Hardware Limits and Memory Bandwidth

Running local AI on conventional consumer hardware is often constrained less by raw compute power and more by memory bandwidth limitations. While enterprise hardware relies on High Bandwidth Memory (HBM), consumer PCs and mobile devices typically use standard system memory, which is considerably slower.

During inference, processors spend much of their time transferring parameters from VRAM to compute units. Valuable compute cycles remain unused during these transfer operations. MTP bridges these idle periods by allowing the lightweight drafter model to generate speculative tokens while the primary model is still busy fetching data. This significantly improves overall efficiency and noticeably reduces latency for end users.

Benchmarks: From Pixel to Apple Silicon

Google has published benchmark data for multiple hardware configurations. The results indicate that mobile devices and modern desktop chips benefit the most:

  • Google Pixel: The smaller E2B and E4B models run 2.8x and 3.1x faster on Pixel smartphones.
  • Apple M4 Silicon: The significantly larger Gemma 4 31B model achieves a 2.5x speed increase on Apple chips.
  • NVIDIA RTX PRO 6000: Tests with the Gemma 4 26B model showed latency cut in half while maintaining the same output quality.

Beyond raw speed improvements, the technology also boosts energy efficiency. Faster generation means processors remain under load for shorter periods, which positively impacts battery life on mobile devices. MTP also enables larger models such as the 31B dense model to run more smoothly on hardware that previously struggled with performance limits.

Google Changes Gemma 4 Licensing

Another important strategic move is Google’s licensing update. The company has placed the Gemma 4 models under the Apache 2.0 license. This is significantly more permissive than the previously used Gemma-specific licenses and gives developers greater flexibility to use and modify the models. The new MTP drafters are included under the same licensing terms.

Developers can already begin experimenting with speculative decoding today. The MTP-enabled versions are compatible with widely used frameworks, including:

  • MLX (optimized for Apple Silicon)
  • vLLM and SGLang (for server environments)
  • Ollama (for simple local deployment)

This broad ecosystem support lowers the barrier for building local AI applications capable of responding to user input in real time without relying on cloud connectivity.

Lisa Löw

Lisa

Löw

Junior Editor

it-daily.net

Ad

Artikel zu diesem Thema

Weitere Artikel