Notes on Google Gemma 4
Google DeepMind have published a new release of their open weights model: Gemma 4. It’s available under Apache License 2.0, an improvement from the more restrictive Gemma 3 license. Core features include multi-language support, Reasoning, and multi-modality including Audio for some models. It comes in four different variants:
In the model descriptions, “E” stands for “effective parameters” (without embeddings layer, total size including embeddings layer given in parentheses above), “A” stands for “active parameters”.
The edge-device models E2B and E4B feature a smaller vision encoder than the others (~150M vs. ~550M).
Audio support
Instructions for using the Audio Capabilities locally through HuggingFace Transformers here, instructions for Apple MLX:
Caveat: the maximum clip length supported by these models (E2B, E4B) is 30 seconds.
Benchmarks
Simon Willison has run his usual Pelican comparison. The model card has numbers for benchmarks like MMLU Pro (comparable with Qwen3.5-35B-A3B) and Tau2 (lagging behind Qwen). Artificial Analysis have not yet published numbers.
Availability
The on-device models Gemma 4 E2B and E4B can be tried through the Google AI Edge Gallery app which is available for Android and iOS. The data center models are listed as available through Google AI Studio and Novita currently (via HuggingFace Inference Providers or OpenRouter). Caveats:
with AI Studio: grounding with Google Search cannot be disabled there
with OpenRouter: error: “No endpoints found that support tool use”
XNNPACK crash with LiteRT-LM
Update 2026-04-12: when trying to use Gemma 4 Edge on real iOS devices, I hit this XNNPACK/GEMM crash in the vision encoder:
(lldb) bt
* thread #20, stop reason = EXC_BAD_ACCESS (code=1, address=0x12eb71c0)
* frame #0: 0x00000001031a0d64 liblitert_lm_c_api.dylib`xnn_f32_gemm_minmax_ukernel_6x8__asm_aarch64_neonfma_ld128 + 548
frame #1: 0x00000001030cf490 liblitert_lm_c_api.dylib`xnn_compute_gemm + 156
frame #2: 0x00000001031bc51c liblitert_lm_c_api.dylib`thread_parallelize_2d_tile_2d_dynamic + 448
frame #3: 0x00000001031c1d0c liblitert_lm_c_api.dylib`thread_main + 632
frame #4: 0x00000001fd3d444c libsystem_pthread.dylib`_pthread_start + 136I took a detour in trying to get an Apple MLX-based approach running, where the PhoneClaw project dispensed useful insights to the Codex agent working on my own codebase. However, some high-profile community snapshots appear broken because the novel technique of Per-Layer Embeddings is converted incorrectly:
Existing MLX quantized Gemma 4 models (mlx-community, unsloth) produce garbage output due to quantizing PLE (Per-Layer Embedding) layers. This [FakeRockert543] repo provides working quantized weights. […] Gemma 4 uses a novel PLE (Per-Layer Embeddings) architecture with ScaledLinear
layers that multiply outputs by a learned scalar. Standard quantization introduces rounding error in these layers, and the scalar amplifies it — producing ionoxffionoxff... garbage.Our fix: Only quantize the large decoder nn.Linear and SwitchLinear (MoE expert) layers. Everything else stays bf16
(Emphasis mine).
Also, the Osaurus MLX quant of Gemma 4 E2B notes:
Some mlx-community conversions of Gemma 4 have broken or zeroed-out vision/audio tower weights, producing models that appear functional for text but silently fail on image and audio inputs.
Even these proved inadequate for running on iPhone (6 GB RAM) or iPad (8 GB RAM), either for size (particularly E4B) or (vision) response quality.
Back to TensorRT-LM, the remedy to the XNNPACK crash was two-fold:
use the C API litert_lm_engine_settings_set_cache_dir() call to set the cache to a writable location (NSCachesDirectory), but keep the model files in a read-only folder. The runtime will resolve the vision cache directory not to the writable location but the read-only location - yielding a runtime warning but not a crash(?!)
configure the engine to use “cpu” for both text and vision: litert_lm_engine_settings_create(“cpu”, “cpu”, nullptr)
This allows at least the E2B variant to run on Apple devices with 6 GB RAM.

