Gemma 4 QAT Models: Advancing Model Compression for Mobile Devices and Laptops

The latest additions to the Gemma 4 family leverage Quantization-Aware Training (QAT) to significantly lower memory requirements while maximizing performance on local devices.

Since the launch of Gemma 4 two months ago, we have continued expanding the model family’s capabilities. We first introduced Multi-Token Prediction (MTP) to speed up inference performance, and more recently released a 12B model designed to fill the gap between the E4B and 26B Mixture-of-Experts (MOE) variants.

Today, we are introducing a new set of checkpoints optimized through Quantization-Aware Training, making Gemma 4 even more efficient and enabling deployment on everyday edge devices and consumer-grade GPUs.

By simulating quantization throughout the training process, QAT helps minimize quality degradation that typically occurs when compressing a model. This release includes optimized checkpoints for the widely used Q4_0 quantization format, alongside a newly developed quantization approach specifically designed for mobile deployments. Using this mobile-focused format, we have reduced the memory footprint of Gemma 4 E2B to just 1 GB.

Together, these optimizations dramatically decrease memory requirements while maintaining the quality, performance, and capabilities users expect from the Gemma 4 family.

Preserving Model Quality While Reducing Size

Quantization remains one of the most important technologies for running large language models on consumer hardware. By reducing memory usage, quantization not only lowers storage requirements but also improves inference efficiency and decoding speed.

Traditional Post-Training Quantization (PTQ), however, often introduces some degree of performance loss. Rather than applying quantization only after training is complete, Quantization-Aware Training incorporates quantization effects directly into the training process itself.

While PTQ already performs well at preserving model quality, our QAT approach consistently delivers higher overall quality compared with conventional PTQ baselines.

We applied this QAT methodology across the popular Q4_0 format to maximize performance throughout the Gemma 4 lineup. For our edge-focused models, including E2B and E4B, we went further by redesigning the quantization approach around a specialized mobile-first quantization framework.

Reducing VRAM and Storage Requirements

The chart below outlines the approximate memory requirements needed to load various model configurations and provides an estimate of the VRAM necessary for deployment.

Approximate memory requirements indicating how much VRAM is required to load the models.

Mobile Optimization Under the Hood

Traditional model compression formats are often difficult for mobile processors to execute efficiently. To ensure smooth performance on smartphones and edge hardware, we developed a custom mobile-oriented quantization architecture specifically tailored for mobile accelerators.

Static Activations

In conventional systems, models frequently spend computational resources calculating scaling factors dynamically during inference.

To eliminate this overhead, we precompute these scaling parameters during training. This reduces the workload placed on mobile processors and contributes to faster response times.

Channel-Wise Quantization

We structured compressed model data to align closely with the architecture of modern mobile accelerators.

This allows devices to execute operations natively, avoiding slower compatibility layers and improving overall performance.

Targeted 2-Bit Quantization

We aggressively compressed the components responsible for token generation down to 2-bit precision while preserving higher precision within the model’s core reasoning layers.

This strategy delivers substantial storage savings without significantly affecting the model’s reasoning capabilities or output quality.

Embedding and KV Cache Optimization

Additional compression was applied to the model’s vocabulary embeddings and key-value cache, which serves as the model’s short-term memory.

These optimizations substantially reduce active memory usage and make it possible to maintain longer conversations without exhausting available resources.

Because audio and vision encoders are unnecessary for many deployment scenarios, users can further reduce memory consumption by deploying only the modalities required for their specific use case.

For example, the text-only version of Gemma 4 E2B, without Per-Layer Embeddings enabled, requires less than 1 GB of memory to operate.

Get Started Today

To ensure seamless adoption across a wide range of developer workflows, we have partnered with leading tools and platforms throughout the AI ecosystem to provide immediate support for Gemma 4 QAT checkpoints.

Download the Model Weights

The Q4_0 and mobile-optimized model weights are available today through Hugging Face.

Formats have been tailored to support different deployment environments:

GGUF models are ready for use with llama.cpp
Compressed tensor formats are available for vLLM
Unquantized checkpoints are also provided for developers who wish to convert and quantize models into alternative formats supporting Q4_0

Integrate and Learn

Comprehensive documentation is available to help developers understand deployment best practices and maximize performance when working with QAT checkpoints.

Run Models Locally on Desktop

Developers can easily download, manage, and execute Gemma 4 QAT models locally using popular desktop tools such as:

llama.cpp
Ollama
LM Studio

These platforms provide accessible interfaces for experimenting with and deploying Gemma models on personal hardware.

Deploy Directly on Devices

For optimized edge deployments, developers can use Google’s lightweight LiteRT-LM runtime.

Alternatively, models can be executed directly in the browser using Transformers.js, enabling web-based inference without dedicated server infrastructure.

Use Your Preferred Development Stack

Gemma 4 QAT models integrate with a broad range of development frameworks and serving platforms:

SGLang and vLLM for efficient serving of larger models
MLX for optimized performance on Apple Silicon devices
MTP QAT checkpoints to preserve Multi-Token Prediction acceleration while maintaining quantized efficiency
Hugging Face Transformers and Unsloth for direct fine-tuning and customization

These integrations make it easier than ever to incorporate Gemma 4 into existing workflows while benefiting from the reduced memory requirements and improved efficiency provided by Quantization-Aware Training.

With these new QAT releases, Gemma 4 continues to push the boundaries of efficient on-device AI, enabling powerful language models to run on laptops, smartphones, edge devices, and consumer GPUs without sacrificing quality or performance.