Skip to content

Prefix Cache Behavior in Hybrid Attention Models

Core Finding

Prefix cache reuse is broken for all hybrid-architecture models in mlx-lm. Only pure full-attention models get cache hits.

Mechanism

Hybrid sliding-window attention architectures (e.g., Gemma 3/4's 5:1 sliding+global pattern with sliding_window=1024) use RotatingKVCache. When context exceeds the window size, the cache trims old tokens to maintain the window — destroying shared-prefix alignment between turns.

In a multi-turn conversation, each new turn recomputes the full context from scratch because the prefix is no longer in cache. This makes prefill time scale linearly with total context, with no amortization across turns.

Affected Models

  • Gemma 3 / Gemma 4 (all sizes) — 5:1 sliding+global attention
  • Qwen 3.5 — hybrid attention
  • Llama 4 — hybrid attention
  • GPT-OSS — hybrid attention

Unaffected Models

  • MiniMax M2.5 — pure full attention, confirmed working cache reuse
  • Any model using only standard full-attention layers (GPT-3/4 style)

Empirical Evidence

Benchmarked on Gemma 4 26B MoE (gemma-4-26b-a4b-it-4bit) via mlx-lm v0.31.1 server on a MacBook Air M4 (32 GB unified memory, 10-core GPU, macOS 15.7.4):

  • Small scale (<1k tokens): Appeared to show cache hits (Turn 2 was 4.9x faster than Turn 1), likely because context fit within the 1024-token sliding window.
  • Larger scale (~4k token system prompt): 0% cache hit rate. Every turn did full recomputation. Turn 2-5 times were ~80-90s each vs. ~92s cold start — no improvement.

Prefill Performance (Single Turn)

Context Size TTFT Prefill Rate
1k tokens ~9s ~120 tok/sec
4k tokens ~29s ~140 tok/sec
8k tokens ~58s ~140 tok/sec
16k tokens ~131s ~120 tok/sec
24k tokens ~198s ~120 tok/sec

Generation speed once prefill completes: ~20-27 tok/sec (fast, not the bottleneck).

Practical Implications

  • Local hybrid models are unusable as agent primary models at real context depths (>4k) due to multi-second TTFT on every turn.
  • Context compression or aggressive summarization doesn't help enough — even 8k context means ~1 minute TTFT.
  • The only viable path for local inference with prefix caching is pure full-attention models.

Operational mitigation

Gemma 4 is capped at 16K context (--max-kv-size 16384) and served behind a serializing proxy that prevents concurrent requests from causing OOM crashes. This makes it usable as a fallback model with acceptable (if slow) TTFT, while preventing the broken prefix cache from causing cascading system failures.

Upstream Status

  • Tracked in mlx-lm issue #980.
  • PR #911 ("Better caching in the server") and PR #949 (rotating cache for sliding attention) were merged in mlx-lm v0.31.0 but do not fix the fundamental problem — hybrid architectures cannot reuse prefixes by design.