Local MLX Inference Patterns¶
Overview¶
MLX (Apple's ML framework) enables local LLM inference on Apple Silicon Macs. The mlx-lm library provides a server mode that exposes an OpenAI-compatible API endpoint. However, the server is single-threaded for inference and has no built-in memory safeguards, requiring external protection to run safely alongside other services.
Test Hardware¶
All benchmarks and performance numbers on this page were measured on:
| Spec | Detail |
|---|---|
| Machine | MacBook Air (Mac16,13) |
| Chip | Apple M4 |
| CPU | 10 cores (4P + 6E) |
| GPU | 10-core Apple M4, Metal 3 |
| Unified Memory | 32 GB |
| Storage | 1 TB Apple SSD (AP1024Z) |
| OS | macOS Sequoia 15.7.4 (24G517) |
| MLX Version | mlx-lm 0.31.1 / mlx_vlm 0.4.3 |
Production Architecture¶
Running an MLX server bare is dangerous in any multi-process environment. The safe production setup uses a serializing proxy between clients and the MLX server:
Clients (agent gateway, scripts, curl)
└── :8891 gemma4-proxy.py (serializing queue)
│ • Only 1 inference request forwarded at a time
│ • Concurrent requests queued FIFO
│ • vm_stat memory check before each forward
│ • Clean 503 on timeout/low-memory instead of OOM
└── :8890 mlx_vlm server (raw, never hit directly)
Why a proxy is required¶
MLX server accepts concurrent HTTP connections but processes inference single-threaded. When multiple requests arrive simultaneously, the server allocates KV cache memory for each request in parallel — without backpressure. On a 32 GB machine running a 15.7 GB model, just 2-3 concurrent requests can spike memory past available limits. macOS Jetsam (the OOM killer) then terminates the largest victim process, which is often not the MLX server but the agent gateway or other critical service sharing the machine.
Empirical confirmation (2026-04-04): 5 concurrent curl requests to the raw MLX server caused all 5 to be SIGKILL'd, left the server hung, and triggered Jetsam to kill the OpenClaw gateway process. The gateway died at 16:20 PDT with "1006 abnormal closure." After adding the proxy, 3 concurrent requests queued and completed with zero failures and 7.4 GB free RAM.
Proxy features¶
- Request serialization: threading.Lock ensures only one inference request reaches MLX at a time. Other requests block in a FIFO queue.
- Memory pressure gate: checks
vm_statfree+inactive pages before forwarding. Rejects with 503 if below threshold (default 4 GB). - Queue timeout: requests waiting longer than 120s get a clean 503 error instead of hanging.
- Stats endpoint:
/proxy/statsreturns request counts, queue depth, free RAM, and MLX health. - Passthrough: non-inference requests (model listing, health) go through without the lock.
- Zero dependencies: stdlib-only Python.
Configuration¶
| Env Var | Default | Description |
|---|---|---|
GEMMA_PROXY_PORT |
8891 | Proxy listen port |
GEMMA_MLX_PORT |
8890 | Backend MLX server port |
GEMMA_QUEUE_TIMEOUT |
120 | Max queue wait (seconds) |
GEMMA_MIN_FREE_RAM_GB |
4.0 | Reject threshold (GB free) |
Source: gemma4-local/gemma4-proxy.py
Key Performance Characteristics¶
On the M4 MacBook Air (32 GB unified memory, ~100 GB/s memory bandwidth), MLX inference is memory-bandwidth bound:
- Prefill: ~120-140 tok/sec for 4-bit quantized 26B MoE models. Scales linearly with context length.
- Generation: ~20-42 tok/sec depending on model and quantization. Stable regardless of context size.
- Memory: 26B MoE 4-bit uses ~15.7 GB peak RAM (of 32 GB available).
Server mode (26B, through proxy)¶
- Prompt: 60–123 tok/s
- Generation: 33–35 tok/s steady, 60 tok/s peak
- Peak RAM: 15.7 GB
Server Mode¶
mlx_vlm provides an OpenAI-compatible HTTP server:
mlx_vlm.server --model <model-id> --port 8890 --kv-bits 3 --kv-quant-scheme turboquant --max-kv-size 16384
Key server flags:
- --kv-bits / --kv-quant-scheme: KV cache quantization (TurboQuant reduces memory for long contexts)
- --max-kv-size: Maximum KV cache entries (cap to prevent OOM; 16384 = 16K context)
- --prefill-step-size: Controls prefill chunking
The server exposes /v1/chat/completions and /v1/models. Never expose port 8890 directly to clients — always route through the proxy.
Memory Management¶
Wired memory cap (system-level)¶
MLX claims all available GPU memory by default. Set a system-level cap to leave room for other processes:
sudo sysctl iogpu.wired_limit_mb=16384 # 16 GB for MLX, 16 GB for everything else
Previous 20 GB cap was too thin — left only 12 GB for the OS + Docker + agent gateway, which proved insufficient under load.
Context cap (application-level)¶
--max-kv-size 16384 caps context at 16K tokens. Required because prefix cache is broken on hybrid attention models (see Prefix Cache Behavior), making TTFT scale linearly. At 16K, TTFT is ~131s — the practical ceiling for a usable interaction.
Defense in depth¶
Three layers prevent memory-related crashes:
| Layer | Mechanism | What it prevents |
|---|---|---|
| Serializing proxy | Only 1 request at a time + memory check | Concurrent allocation spikes |
| Wired memory cap | iogpu.wired_limit_mb=16384 |
MLX claiming all GPU memory |
| Context cap | --max-kv-size 16384 |
Unbounded KV cache growth |
Quantization Trade-offs¶
- 4-bit (Q4): Best throughput/quality balance for ≤32B models. ~4x memory reduction.
- TurboQuant KV-3: Reduces KV cache memory with minimal quality loss. Enables longer contexts within memory budget.
Practical Limits for Agent Use¶
A local model is viable as an agent primary model only if: 1. Prefix caching works (requires pure full-attention architecture — see Prefix Cache Behavior) 2. TTFT stays under ~5 seconds for typical context sizes 3. Context window is ≥32k tokens
As of 2026-04, no major hybrid-attention model meets all three criteria on local MLX. The 16K-capped Gemma 4 is usable as a fallback model with the proxy safety layer.
Integration with Agent Frameworks¶
Both OpenClaw and Hermes support custom/local providers via base_url configuration. Always point the config at the proxy port (8891), not the raw MLX port (8890).
{
"gemma4-mlx": {
"baseUrl": "http://127.0.0.1:8891/v1"
}
}
Caution: agent frameworks that compress large session histories through the model (e.g., on startup) can choke local models. Ensure compression/summarization uses a cloud model, not the local endpoint.