Google DeepMind Unveils Gemma 4: A Multimodal Model for O...

Google DeepMind Unveils Gemma 4: A Multimodal Model for On-Device Use

Introduction to Gemma 4

Google DeepMind has released the Gemma 4 family of multimodal models, now available on Hugging Face. These models support a range of tools and libraries, including transformers, llama.cpp, MLX, WebGPU, and Rust. The series is open-source under Apache 2 licenses and features high-quality performance across text, image, and audio modalities. Gemma 4 expands upon previous iterations with enhanced capabilities for on-device deployment and versatile use cases.

Key Features of Gemma 4

Gemma 4 supports inputs such as images, text, and audio, generating text responses. The text decoder is based on the Gemma model architecture, enabling long context windows. Image encoders have been improved to handle variable aspect ratios and configurable token counts for balancing speed, memory, and quality. Smaller models (E2B and E4B) also support audio inputs. Four variants are available:
– Gemma 4 E2B: 2.3B effective parameters (5.1B with embeddings), 128k context window
– Gemma 4 E4B: 4.5B effective parameters (8B with embeddings), 128k context window
– Gemma 4 31B: Dense model, 256k context window
– Gemma 4 26B A4B: Mixture-of-experts model with 4B active parameters and 256k context window

Architecture Innovations

The architecture combines components from prior Gemma models while omitting complex or inconclusive features like Altup. Key design elements include:
– Alternating Attention Layers: Sliding-window attention for smaller models (512 tokens) and full-context attention for larger ones (1024 tokens).
– Dual RoPE Configurations: Standard RoPE for sliding layers, proportional RoPE for global layers to extend context length.
– Per-Layer Embeddings (PLE): A secondary embedding pathway that provides token-specific signals per decoder layer, enhancing specialization without significant parameter costs.
– Shared KV Cache: Reuses key-value states from earlier layers in the model’s final stages, reducing memory and computational overhead during inference.

Support for Multimodal Tasks

Gemma 4 demonstrates strong performance across multimodal tasks without requiring explicit training guidance. Benchmarks indicate its text-only LMArena score reaches 1452 for the 31B dense model and 1441 for the 26B MoE variant (using 4B active parameters). Multimodal capabilities include object detection, speech-to-text, OCR, and function calling. For example, the model can generate JSON-formatted bounding boxes for GUI elements or reconstruct HTML pages from descriptions.

Deployment Flexibility

Gemma 4 is designed for broad deployment, supporting frameworks like transformers, llama.cpp, MLX, and WebGPU. Developers can integrate it with local agents, fine-tune using tools such as TRL on Vertex AI, or leverage Unsloth Studio for optimization. The models’ quantization-friendly design ensures compatibility with resource-constrained devices while maintaining performance.

Performance Benchmarks

Benchmarks highlight Gemma 4’s efficiency and effectiveness:
– Long Context Support: The shared KV cache enables efficient processing of extended inputs, critical for on-device applications.
– Multimodal Accuracy: Informal tests suggest comparable performance to text-only tasks, with robust handling of mixed modalities like images and audio.

Collaborative Development

The release reflects collaboration between Google DeepMind and the open-source community. Developers are encouraged to test the models, share feedback, and contribute to ongoing improvements. The availability on Hugging Face ensures accessibility for researchers and practitioners alike.

Written by

Max

Covers AI news, agentic AI, LLMs, and tech developments. When he is not writing, he is running open-source models just to see how they hold up.