TL;DR

  • MLX does not cliff through 40K tokens on Mac Mini M4 Pro.
  • MLX prefill at 15K: 1.650 ms/tok. Ollama FA=0 at 15K: 1.774 ms/tok. Difference: 3%.
  • Two independent runtimes. Same hardware. Same conclusion: the ceiling is memory bandwidth, not attention kernel.
  • The Flash Attention cliff from Exp 007 was an Ollama/llama.cpp artefact. Not Apple Silicon. Not unified memory. Not the model.

Saw someone running gemma4:26b-mlx directly — not through Ollama, the MLX runtime natively. Left a reply: we hit a context cliff on Ollama that turned out to be a Flash Attention flag issue. Curious if you’ve seen similar behaviour on the MLX backend?

Before they answered, I ran the test myself.


Why This Question Matters

Exp 008 and Exp 010 proved that OLLAMA_FLASH_ATTENTION=1 causes the prefill cliff on Mac Mini M4 Pro. Remove the flag — no cliff through 40K tokens. Keep it — cliff at 32.5K, prefill triple the cost at 15K. The 2×2 factorial was unambiguous.

But “Ollama FA=1 cliffs” and “unified memory architectures cliff under long-context attention” are different claims. Exp 008 and Exp 010 ruled out the flag as the cause. They could not rule out that Ollama’s attention kernel, even with FA=0, had implementation-specific behaviour that happened to match the hardware ceiling.

MLX is Apple’s own ML framework for Apple Silicon. No llama.cpp under the hood. Different attention implementation, different memory access patterns, different compilation target. If MLX also cliffs at the same context size — or at all — the hardware architecture is the binding constraint. If MLX does not cliff, the cliff was a runtime implementation detail.


Pre-registered Confounds

Three confounds documented before running a single token:

1. Quantisation format. Ollama runs GGUF Q4_K_M. The MLX model (mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit) uses OptiQ 4-bit. Different compression schemes. This is not a pure runtime isolation — quantisation affects memory footprint and compute patterns.

2. Active parameter count. The MLX model name includes “A4B” — likely 4B active parameters per forward pass in the MoE sparse activation. Ollama reports 25.8B parameters for gemma4:26b. If the MLX model activates fewer experts per token, generation is cheaper per step. This would make gen t/s non-comparable between runtimes. Prefill scaling (how ms/tok grows with context) remains informative regardless.

3. Tokeniser. MLX uses the HuggingFace tokeniser; Ollama embeds the tokeniser in the GGUF. Actual token counts for the same prompt text differ. The bench script reports the MLX tokeniser’s actual count — actual token counts ran at ~89% of target sizes consistently across all cells.


Phase A — Generation Sweep

5 context sizes, 3 reps each, 60s idle between sizes. Ollama daemon stopped before the run. Rep 1 prefill is authoritative — cold KV, no cache reuse.

SizeTokensRep1 prefill ms/tokMean gen t/sOllama FA=0/q8_0 gen ref
4K3,5551.54652.4142.05
8K7,1121.62847.2444.07
15K13,3331.65038.7138.41
25K22,2231.96427.0032.35
35K31,1122.16126.4227.90

No cliff. Prefill grows steadily from 1.546 ms/tok at 4K to 2.161 ms/tok at 35K — roughly linear, no super-quadratic onset.

The gen t/s pattern is different from Ollama. At 4K, MLX runs at 52.41 t/s — 24% faster than Ollama’s 42.05. By 35K, the gap closes: 26.42 vs 27.90. This is the A4B confound in action. With fewer active parameters per forward pass at short context, autoregressive decoding is cheaper per step. As KV cache overhead grows with context length, the per-step compute advantage shrinks and both runtimes converge near the same memory bandwidth ceiling.


Phase B — Cliff Scan

9 sizes from 20K to 40K in 2.5K steps, 2 reps per size. Cliff threshold: 2× the MLX 15K baseline = 3.300 ms/tok.

SizeTokensRep1 prefill ms/tokCliff
20K17,7781.859no
22.5K20,0011.961no
25K22,2232.128no
27.5K24,4452.118no
30K26,6662.238no
32.5K28,8892.498no
35K31,1122.464no
37.5K33,3332.518no
40K35,5552.270no

Peak: 2.518 ms/tok at 37.5K. Threshold: 3.300 ms/tok. Margin: 24%.

No cliff at 32.5K — the exact size where OLLAMA_FLASH_ATTENTION=1 alone triggered a cliff in Exp 010 Condition B. No cliff at 20K — where FA=1 combined with q8_0 pushed Exp 007 past its limit. Smooth growth through 40K.


Hypothesis Verdicts

H1 — CONFIRMED. MLX 15K prefill (1.650 ms/tok) is below the Ollama FA=0/fp16 baseline (1.774 ms/tok). MLX’s unified-memory-native kernels match or marginally outperform the best Ollama configuration.

H2 — CONFIRMED. No prefill cliff through 40K tokens. The FA=1 cliff is absent from MLX entirely.

H3 — CONFIRMED. MLX 25K gen t/s (27.00) falls within the pre-registered ±25% window around Ollama FA=0/fp16 (31.08). Wide tolerance was set because the A4B confound was expected to make gen t/s non-comparable — that is exactly what happened at short context, and convergence by 35K.


What Three Runtimes Now Confirm

RuntimeConfigCliff?15K prefill ms/tok
OllamaFA=1, q8_0yes (20K)25.08
OllamaFA=1, fp16yes (32.5K)5.405
OllamaFA=0, fp16no1.774
OllamaFA=0, q8_0no1.694
MLXdefaultno1.650

The cliff appears in every configuration that touches Flash Attention in llama.cpp. It is absent from every configuration that does not — including an entirely independent runtime.

This is the cleanest possible causal isolation short of inspecting the kernel code: vary the runtime, hold the hardware constant, observe whether the cliff follows the flag or the chip. It follows the flag.


The Implication

The Mac Mini M4 Pro does not have an architectural attention scaling problem. It has a 273 GB/s memory bus that Flash Attention’s tiling overhead fights against on unified memory — because the tile pass was designed for discrete GPU SRAM/HBM hierarchies that Apple Silicon does not have.

Remove the flag, or use a runtime that doesn’t implement FA the same way, and the hardware behaves as the memory bandwidth numbers predict: roughly linear prefill growth, no cliff through 40K tokens.

The ceiling for long-context inference on this machine is memory bandwidth. That is a fixed physical constraint. Everything else — runtime choice, flag configuration, quantisation format — is an engineering decision. And engineering decisions can be changed.


Evidence: exp_011_mlx_runtime/ · scientific_log.md

Tags: Local-Llm, Benchmark, Chronos, MLX, Ollama, Apple-Silicon, Gemma4