apple-silicon context-length gemma4 inference kv-cache mlx mlx-vlm proxy serialization

KV Cache Resumption and Context-Length Scaling on Apple Silicon¶

Overview¶

This page documents empirical findings from benchmarking Gemma 4 26B MoE (4-bit) on Apple Silicon using mlx-vlm 0.4.4. Two related questions are answered:

Context scaling: how does prefill throughput and TTFT degrade as context length grows?
KV cache resumption: how much does reusing a prefilled KV cache speed up subsequent turns?

All benchmarks were run on 2026-04-04. The server setup, benchmark scripts, and launchd service are in tomsalphaclawbot/gemma4-local.

Test Hardware¶

Spec	Detail
Machine	MacBook Air (Mac16,13)
Chip	Apple M4
CPU	10 cores (4 performance + 6 efficiency)
GPU	10-core Apple M4, Metal 3
Unified Memory	32 GB (~100 GB/s bandwidth)
Storage	1 TB Apple SSD (AP1024Z)
OS	macOS Sequoia 15.7.4 (24G517)
mlx-vlm	0.4.4 (with chunked prefill fix, PR #858)
mlx-lm	0.31.1

Context-Length Scaling¶

Setup¶

Model: mlx-community/gemma-4-26b-a4b-it-4bit
Prompts: synthetic filler text (~80 tokens/repetition), followed by a short question

Results¶

Context target	Actual tokens	Prefill (tok/s)	Gen (tok/s)	TTFT	Peak RAM
1K	1,081	283	33	3.8s	16.9 GB
4K	4,387	325	33	13.5s	17.3 GB
8K	8,737	312	31	28.0s	17.6 GB
16K	17,437	289	27	60.4s	18.4 GB
32K	OOM at ~35% through prefill	—	—	—	—

Key observations¶

Prefill throughput degrades with context length. 1K tokens: 283 tok/s. 16K tokens: 289 tok/s average, but individual chunks at the end of the sequence drop to ~270 tok/s. The degradation is gradual — not a cliff — because mlx-vlm 0.4.4 uses chunked prefill (2048 tokens/chunk), so each chunk processes sequentially rather than in one massive matmul.

Generation speed is stable (~27-33 tok/s) regardless of context length, until OOM. Decode is memory-bandwidth bound and largely independent of KV cache size in this range.

RAM grows linearly with context. Model weights occupy ~16 GB. Each 1K tokens adds roughly 130-140 MB of KV cache. At 16K context, peak RAM is 18.4 GB — leaving ~13 GB headroom on 32GB.

Hard wall at ~20-22K tokens. The 32K benchmark OOM'd mid-prefill. At ~12K tokens into the 32K prefill, memory pressure spiked and the OS killed the process. The KV cache alone for 32K tokens exceeds available headroom.

KV Cache Resumption¶

The mechanism¶

mlx-vlm 0.4.4 exposes PromptCacheState (in mlx_vlm.generate) and a prompt_cache_state= kwarg to stream_generate(). The workflow:

Turn 1 (prime): run inference on the base context. The KV cache for every layer is stored in PromptCacheState.
Turn 2+ (resume): pass the same PromptCacheState object. The library computes the longest common prefix between the cached token IDs and the new prompt, trims the cache to that prefix, and only prefills the new tokens.
Result: TTFT for turn 2 is proportional to the delta (new tokens only), not the full context.

from mlx_vlm import stream_generate
from mlx_vlm.generate import PromptCacheState

cache_state = PromptCacheState()

# Turn 1: pay full prefill cost, cache is populated
for chunk in stream_generate(model, processor, base_prompt,
                              max_tokens=40, prompt_cache_state=cache_state):
    ...

# Turn 2: only new tokens are prefilled
for chunk in stream_generate(model, processor, base_plus_delta_prompt,
                              max_tokens=40, prompt_cache_state=cache_state):
    ...

Results (Gemma 4 26B MoE)¶

Base ctx	Delta	Cold TTFT	Warm TTFT	Speedup
2K	+200 tok	6.9s	0.8s	8.2×
2K	+500 tok	7.5s	1.6s	4.6×
2K	+1K tok	8.9s	3.1s	2.8×
4K	+200 tok	14.0s	0.9s	15.8×

Speedup scales with the base-to-delta ratio. A 200-token delta on a 4K base skips 95% of prefill work: cold is 14s, warm is 0.9s. The warm TTFT approaches the raw throughput limit for just the delta tokens (~200 / 2722 ≈ 0.07s overhead + scheduling).

The warm run reports extremely high nominal throughput (2722–4966 tok/s) because the denominator is the full prompt token count but almost none of it was actually processed.

Important limitations¶

In-process only. The KV cache lives in GPU (unified) memory. There is no serialization to disk. If the model process restarts, you pay the full prefill cost again. The HTTP server (mlx_vlm.server) does not persist KV state across requests — each request is currently a cold start.

RAM doubles during the transition. While priming the cache and then running the warm generation, both the old KV state and the new prefill are simultaneously in memory. This effectively halves the usable context window for cache resumption. On 32GB with the 26B model: - Safe resumed context: ~8K tokens - Theoretical max: ~20K, but holding cache + new prefill simultaneously pushes OOM

Common pitfall: leaking caches across iterations. A benchmark or application that creates PromptCacheState once and reuses it across unrelated conversations will accumulate stale GPU allocations. Always instantiate a fresh PromptCacheState() per conversation, and call gc.collect() when discarding a cache.

Practical Design Patterns¶

Pattern 1: Long system prompt + short user queries¶

Ideal use case. Prefill the system prompt once and cache it. Every subsequent user query only pays for prefill of the user message (~50-200 tokens typically).

First turn:  [16K system prompt + first user message]  → 60s TTFT
Turn 2+:     [16K system prompt + new user message]    → ~1s TTFT (only ~100 tokens prefilled)

Pattern 2: Document Q&A session¶

Load a large document (8-16K tokens), ask multiple questions about it. Cache the document embedding on turn 1, answer subsequent questions cheaply.

Pattern 3: Agent memory injection¶

Inject a memory summary as a long prefix at session start. Cache it. Agent reasoning loop only prefills new observations and tool results.

Anti-pattern: Per-request server with KV cache¶

The standard HTTP server does not maintain session state. Using PromptCacheState across HTTP requests would require maintaining a session store mapping request IDs to in-memory KV caches — a stateful server design that currently doesn't exist in mlx-vlm. Until that's built, cache resumption only works in single-process chat loops.

System Stability: Serialization and Memory Safety¶

The OOM footgun¶

MLX server is single-threaded for inference but accepts concurrent HTTP connections. It has no built-in memory safeguards. When multiple requests arrive simultaneously, the server allocates KV cache memory for each in parallel without backpressure. On a 32 GB machine with a 15.7 GB model, 2-3 concurrent requests can spike memory past available limits, causing macOS Jetsam to kill other processes.

Confirmed empirically (2026-04-04): 5 concurrent requests to the raw MLX server caused SIGKILL on all processes and took down the OpenClaw agent gateway.

Serializing proxy (the fix)¶

A lightweight HTTP proxy (gemma4-proxy.py) sits between all clients and the MLX server, ensuring only one inference request reaches MLX at a time:

Clients → :8891 proxy (serializing queue + memory gate) → :8890 MLX server

Features: FIFO request queue, vm_stat memory pressure check before forwarding, 120s queue timeout with clean 503, stats/health endpoints. Zero external dependencies.

After deploying the proxy, 3 concurrent requests queued and completed with zero failures and 7.4 GB free RAM — the exact scenario that previously crashed the system.

See Local MLX Inference Patterns for full proxy documentation.

Docker coexistence¶

Running MLX alongside Docker Desktop is dangerous on 32GB unified memory without a wired memory cap. The failure chain: 1. MLX fills available unified memory (model weights ~16GB + KV caches) 2. macOS swaps aggressively to disk 3. Docker's VM process does heavy paged writes simultaneously 4. System hits macOS's disk write throttle → hard reset

Diagnostic source: /Library/Logs/DiagnosticReports/ResetCounter-*.diag.

Defense in depth (current production config)¶

Layer	Setting	Purpose
Serializing proxy	`gemma4-proxy.py` on :8891	Prevents concurrent inference requests
Wired memory cap	`iogpu.wired_limit_mb=16384`	Caps MLX GPU memory at 16 GB
Context cap	`--max-kv-size 16384`	Limits KV cache to 16K tokens
Memory gate	`GEMMA_MIN_FREE_RAM_GB=4.0`	Proxy rejects if <4 GB free

# One-time setup (requires sudo)
sudo sysctl iogpu.wired_limit_mb=16384
echo 'iogpu.wired_limit_mb=16384' | sudo tee -a /etc/sysctl.conf

This leaves 16 GB for Docker, OS, agent gateway, and other processes. See gemma4-local for the full server + proxy setup.

TurboQuant KV Compression¶

Separately from resumption, mlx-vlm 0.4.4 includes TurboQuant (PR #858) — Walsh-Hadamard rotation + Lloyd-Max quantization of the KV cache itself:

turbo3 (3-bit K, V): 4.6× KV memory reduction, +1.06% perplexity
turbo4 (4-bit K, V): 3.8× reduction, +0.23% perplexity
V compression is nearly free (zero quality impact at 2-bit); all quality loss comes from K compression

With TurboQuant KV-3, a 16K context KV cache that would normally take ~2.3 GB takes ~500 MB — roughly tripling the effective context budget before hitting the RAM wall.

Enable via: --kv-bits 3 --kv-quant-scheme turboquant on the server.

External References¶

@tomchapin on X (2026-04-04) — shared these findings with commentary: "Using more than 16k worth of context caused the whole system to crash"

Local MLX Inference Patterns
Prefix Cache Behavior in Hybrid Attention Models
gemma4-local — scripts, server, launchd service, and benchmark code from this article: github.com/tomsalphaclawbot/gemma4-local