Skip to content

Local MLX Inference Patterns

Overview

MLX (Apple's ML framework) enables local LLM inference on Apple Silicon Macs. The mlx-lm library provides a server mode that exposes an OpenAI-compatible API endpoint. However, the server is single-threaded for inference and has no built-in memory safeguards, requiring external protection to run safely alongside other services.

Test Hardware

All benchmarks and performance numbers on this page were measured on:

Spec Detail
Machine MacBook Air (Mac16,13)
Chip Apple M4
CPU 10 cores (4P + 6E)
GPU 10-core Apple M4, Metal 3
Unified Memory 32 GB
Storage 1 TB Apple SSD (AP1024Z)
OS macOS Sequoia 15.7.4 (24G517)
MLX Version mlx-lm 0.31.1 / mlx_vlm 0.4.3

Production Architecture

Running an MLX server bare is dangerous in any multi-process environment. The safe production setup uses a serializing proxy between clients and the MLX server:

Clients (agent gateway, scripts, curl)
  └── :8891 gemma4-proxy.py (serializing queue)
        │  • Only 1 inference request forwarded at a time
        │  • Concurrent requests queued FIFO
        │  • vm_stat memory check before each forward
        │  • Clean 503 on timeout/low-memory instead of OOM
        └── :8890 mlx_vlm server (raw, never hit directly)

Why a proxy is required

MLX server accepts concurrent HTTP connections but processes inference single-threaded. When multiple requests arrive simultaneously, the server allocates KV cache memory for each request in parallel — without backpressure. On a 32 GB machine running a 15.7 GB model, just 2-3 concurrent requests can spike memory past available limits. macOS Jetsam (the OOM killer) then terminates the largest victim process, which is often not the MLX server but the agent gateway or other critical service sharing the machine.

Empirical confirmation (2026-04-04): 5 concurrent curl requests to the raw MLX server caused all 5 to be SIGKILL'd, left the server hung, and triggered Jetsam to kill the OpenClaw gateway process. The gateway died at 16:20 PDT with "1006 abnormal closure." After adding the proxy, 3 concurrent requests queued and completed with zero failures and 7.4 GB free RAM.

Proxy features

  • Request serialization: threading.Lock ensures only one inference request reaches MLX at a time. Other requests block in a FIFO queue.
  • Memory pressure gate: checks vm_stat free+inactive pages before forwarding. Rejects with 503 if below threshold (default 4 GB).
  • Queue timeout: requests waiting longer than 120s get a clean 503 error instead of hanging.
  • Stats endpoint: /proxy/stats returns request counts, queue depth, free RAM, and MLX health.
  • Passthrough: non-inference requests (model listing, health) go through without the lock.
  • Zero dependencies: stdlib-only Python.

Configuration

Env Var Default Description
GEMMA_PROXY_PORT 8891 Proxy listen port
GEMMA_MLX_PORT 8890 Backend MLX server port
GEMMA_QUEUE_TIMEOUT 120 Max queue wait (seconds)
GEMMA_MIN_FREE_RAM_GB 4.0 Reject threshold (GB free)

Source: gemma4-local/gemma4-proxy.py

Key Performance Characteristics

On the M4 MacBook Air (32 GB unified memory, ~100 GB/s memory bandwidth), MLX inference is memory-bandwidth bound:

  • Prefill: ~120-140 tok/sec for 4-bit quantized 26B MoE models. Scales linearly with context length.
  • Generation: ~20-42 tok/sec depending on model and quantization. Stable regardless of context size.
  • Memory: 26B MoE 4-bit uses ~15.7 GB peak RAM (of 32 GB available).

Server mode (26B, through proxy)

  • Prompt: 60–123 tok/s
  • Generation: 33–35 tok/s steady, 60 tok/s peak
  • Peak RAM: 15.7 GB

Server Mode

mlx_vlm provides an OpenAI-compatible HTTP server:

mlx_vlm.server --model <model-id> --port 8890 --kv-bits 3 --kv-quant-scheme turboquant --max-kv-size 16384

Key server flags: - --kv-bits / --kv-quant-scheme: KV cache quantization (TurboQuant reduces memory for long contexts) - --max-kv-size: Maximum KV cache entries (cap to prevent OOM; 16384 = 16K context) - --prefill-step-size: Controls prefill chunking

The server exposes /v1/chat/completions and /v1/models. Never expose port 8890 directly to clients — always route through the proxy.

Memory Management

Wired memory cap (system-level)

MLX claims all available GPU memory by default. Set a system-level cap to leave room for other processes:

sudo sysctl iogpu.wired_limit_mb=16384  # 16 GB for MLX, 16 GB for everything else

Previous 20 GB cap was too thin — left only 12 GB for the OS + Docker + agent gateway, which proved insufficient under load.

Context cap (application-level)

--max-kv-size 16384 caps context at 16K tokens. Required because prefix cache is broken on hybrid attention models (see Prefix Cache Behavior), making TTFT scale linearly. At 16K, TTFT is ~131s — the practical ceiling for a usable interaction.

Defense in depth

Three layers prevent memory-related crashes:

Layer Mechanism What it prevents
Serializing proxy Only 1 request at a time + memory check Concurrent allocation spikes
Wired memory cap iogpu.wired_limit_mb=16384 MLX claiming all GPU memory
Context cap --max-kv-size 16384 Unbounded KV cache growth

Quantization Trade-offs

  • 4-bit (Q4): Best throughput/quality balance for ≤32B models. ~4x memory reduction.
  • TurboQuant KV-3: Reduces KV cache memory with minimal quality loss. Enables longer contexts within memory budget.

Practical Limits for Agent Use

A local model is viable as an agent primary model only if: 1. Prefix caching works (requires pure full-attention architecture — see Prefix Cache Behavior) 2. TTFT stays under ~5 seconds for typical context sizes 3. Context window is ≥32k tokens

As of 2026-04, no major hybrid-attention model meets all three criteria on local MLX. The 16K-capped Gemma 4 is usable as a fallback model with the proxy safety layer.

Integration with Agent Frameworks

Both OpenClaw and Hermes support custom/local providers via base_url configuration. Always point the config at the proxy port (8891), not the raw MLX port (8890).

{
  "gemma4-mlx": {
    "baseUrl": "http://127.0.0.1:8891/v1"
  }
}

Caution: agent frameworks that compress large session histories through the model (e.g., on startup) can choke local models. Ensure compression/summarization uses a cloud model, not the local endpoint.