Skip to content

KV Cache Resumption and Context-Length Scaling on Apple Silicon

Overview

This page documents empirical findings from benchmarking Gemma 4 26B MoE (4-bit) on Apple Silicon using mlx-vlm 0.4.4. Two related questions are answered:

  1. Context scaling: how does prefill throughput and TTFT degrade as context length grows?
  2. KV cache resumption: how much does reusing a prefilled KV cache speed up subsequent turns?

All benchmarks were run on 2026-04-04. The server setup, benchmark scripts, and launchd service are in tomsalphaclawbot/gemma4-local.


Test Hardware

Spec Detail
Machine MacBook Air (Mac16,13)
Chip Apple M4
CPU 10 cores (4 performance + 6 efficiency)
GPU 10-core Apple M4, Metal 3
Unified Memory 32 GB (~100 GB/s bandwidth)
Storage 1 TB Apple SSD (AP1024Z)
OS macOS Sequoia 15.7.4 (24G517)
mlx-vlm 0.4.4 (with chunked prefill fix, PR #858)
mlx-lm 0.31.1

Context-Length Scaling

Setup

  • Model: mlx-community/gemma-4-26b-a4b-it-4bit
  • Prompts: synthetic filler text (~80 tokens/repetition), followed by a short question

Results

Context target Actual tokens Prefill (tok/s) Gen (tok/s) TTFT Peak RAM
1K 1,081 283 33 3.8s 16.9 GB
4K 4,387 325 33 13.5s 17.3 GB
8K 8,737 312 31 28.0s 17.6 GB
16K 17,437 289 27 60.4s 18.4 GB
32K OOM at ~35% through prefill

Key observations

Prefill throughput degrades with context length. 1K tokens: 283 tok/s. 16K tokens: 289 tok/s average, but individual chunks at the end of the sequence drop to ~270 tok/s. The degradation is gradual — not a cliff — because mlx-vlm 0.4.4 uses chunked prefill (2048 tokens/chunk), so each chunk processes sequentially rather than in one massive matmul.

Generation speed is stable (~27-33 tok/s) regardless of context length, until OOM. Decode is memory-bandwidth bound and largely independent of KV cache size in this range.

RAM grows linearly with context. Model weights occupy ~16 GB. Each 1K tokens adds roughly 130-140 MB of KV cache. At 16K context, peak RAM is 18.4 GB — leaving ~13 GB headroom on 32GB.

Hard wall at ~20-22K tokens. The 32K benchmark OOM'd mid-prefill. At ~12K tokens into the 32K prefill, memory pressure spiked and the OS killed the process. The KV cache alone for 32K tokens exceeds available headroom.


KV Cache Resumption

The mechanism

mlx-vlm 0.4.4 exposes PromptCacheState (in mlx_vlm.generate) and a prompt_cache_state= kwarg to stream_generate(). The workflow:

  1. Turn 1 (prime): run inference on the base context. The KV cache for every layer is stored in PromptCacheState.
  2. Turn 2+ (resume): pass the same PromptCacheState object. The library computes the longest common prefix between the cached token IDs and the new prompt, trims the cache to that prefix, and only prefills the new tokens.
  3. Result: TTFT for turn 2 is proportional to the delta (new tokens only), not the full context.
from mlx_vlm import stream_generate
from mlx_vlm.generate import PromptCacheState

cache_state = PromptCacheState()

# Turn 1: pay full prefill cost, cache is populated
for chunk in stream_generate(model, processor, base_prompt,
                              max_tokens=40, prompt_cache_state=cache_state):
    ...

# Turn 2: only new tokens are prefilled
for chunk in stream_generate(model, processor, base_plus_delta_prompt,
                              max_tokens=40, prompt_cache_state=cache_state):
    ...

Results (Gemma 4 26B MoE)

Base ctx Delta Cold TTFT Warm TTFT Speedup
2K +200 tok 6.9s 0.8s 8.2×
2K +500 tok 7.5s 1.6s 4.6×
2K +1K tok 8.9s 3.1s 2.8×
4K +200 tok 14.0s 0.9s 15.8×

Speedup scales with the base-to-delta ratio. A 200-token delta on a 4K base skips 95% of prefill work: cold is 14s, warm is 0.9s. The warm TTFT approaches the raw throughput limit for just the delta tokens (~200 / 2722 ≈ 0.07s overhead + scheduling).

The warm run reports extremely high nominal throughput (2722–4966 tok/s) because the denominator is the full prompt token count but almost none of it was actually processed.

Important limitations

In-process only. The KV cache lives in GPU (unified) memory. There is no serialization to disk. If the model process restarts, you pay the full prefill cost again. The HTTP server (mlx_vlm.server) does not persist KV state across requests — each request is currently a cold start.

RAM doubles during the transition. While priming the cache and then running the warm generation, both the old KV state and the new prefill are simultaneously in memory. This effectively halves the usable context window for cache resumption. On 32GB with the 26B model: - Safe resumed context: ~8K tokens - Theoretical max: ~20K, but holding cache + new prefill simultaneously pushes OOM

Common pitfall: leaking caches across iterations. A benchmark or application that creates PromptCacheState once and reuses it across unrelated conversations will accumulate stale GPU allocations. Always instantiate a fresh PromptCacheState() per conversation, and call gc.collect() when discarding a cache.


Practical Design Patterns

Pattern 1: Long system prompt + short user queries

Ideal use case. Prefill the system prompt once and cache it. Every subsequent user query only pays for prefill of the user message (~50-200 tokens typically).

First turn:  [16K system prompt + first user message]  → 60s TTFT
Turn 2+:     [16K system prompt + new user message]    → ~1s TTFT (only ~100 tokens prefilled)

Pattern 2: Document Q&A session

Load a large document (8-16K tokens), ask multiple questions about it. Cache the document embedding on turn 1, answer subsequent questions cheaply.

Pattern 3: Agent memory injection

Inject a memory summary as a long prefix at session start. Cache it. Agent reasoning loop only prefills new observations and tool results.

Anti-pattern: Per-request server with KV cache

The standard HTTP server does not maintain session state. Using PromptCacheState across HTTP requests would require maintaining a session store mapping request IDs to in-memory KV caches — a stateful server design that currently doesn't exist in mlx-vlm. Until that's built, cache resumption only works in single-process chat loops.


System Stability: Serialization and Memory Safety

The OOM footgun

MLX server is single-threaded for inference but accepts concurrent HTTP connections. It has no built-in memory safeguards. When multiple requests arrive simultaneously, the server allocates KV cache memory for each in parallel without backpressure. On a 32 GB machine with a 15.7 GB model, 2-3 concurrent requests can spike memory past available limits, causing macOS Jetsam to kill other processes.

Confirmed empirically (2026-04-04): 5 concurrent requests to the raw MLX server caused SIGKILL on all processes and took down the OpenClaw agent gateway.

Serializing proxy (the fix)

A lightweight HTTP proxy (gemma4-proxy.py) sits between all clients and the MLX server, ensuring only one inference request reaches MLX at a time:

Clients → :8891 proxy (serializing queue + memory gate) → :8890 MLX server

Features: FIFO request queue, vm_stat memory pressure check before forwarding, 120s queue timeout with clean 503, stats/health endpoints. Zero external dependencies.

After deploying the proxy, 3 concurrent requests queued and completed with zero failures and 7.4 GB free RAM — the exact scenario that previously crashed the system.

See Local MLX Inference Patterns for full proxy documentation.

Docker coexistence

Running MLX alongside Docker Desktop is dangerous on 32GB unified memory without a wired memory cap. The failure chain: 1. MLX fills available unified memory (model weights ~16GB + KV caches) 2. macOS swaps aggressively to disk 3. Docker's VM process does heavy paged writes simultaneously 4. System hits macOS's disk write throttle → hard reset

Diagnostic source: /Library/Logs/DiagnosticReports/ResetCounter-*.diag.

Defense in depth (current production config)

Layer Setting Purpose
Serializing proxy gemma4-proxy.py on :8891 Prevents concurrent inference requests
Wired memory cap iogpu.wired_limit_mb=16384 Caps MLX GPU memory at 16 GB
Context cap --max-kv-size 16384 Limits KV cache to 16K tokens
Memory gate GEMMA_MIN_FREE_RAM_GB=4.0 Proxy rejects if <4 GB free
# One-time setup (requires sudo)
sudo sysctl iogpu.wired_limit_mb=16384
echo 'iogpu.wired_limit_mb=16384' | sudo tee -a /etc/sysctl.conf

This leaves 16 GB for Docker, OS, agent gateway, and other processes. See gemma4-local for the full server + proxy setup.


TurboQuant KV Compression

Separately from resumption, mlx-vlm 0.4.4 includes TurboQuant (PR #858) — Walsh-Hadamard rotation + Lloyd-Max quantization of the KV cache itself:

  • turbo3 (3-bit K, V): 4.6× KV memory reduction, +1.06% perplexity
  • turbo4 (4-bit K, V): 3.8× reduction, +0.23% perplexity
  • V compression is nearly free (zero quality impact at 2-bit); all quality loss comes from K compression

With TurboQuant KV-3, a 16K context KV cache that would normally take ~2.3 GB takes ~500 MB — roughly tripling the effective context budget before hitting the RAM wall.

Enable via: --kv-bits 3 --kv-quant-scheme turboquant on the server.


External References

  • @tomchapin on X (2026-04-04) — shared these findings with commentary: "Using more than 16k worth of context caused the whole system to crash"