Same Hardware. Different Runtime. Same Result.

TL;DR

MLX does not cliff through 40K tokens on Mac Mini M4 Pro.
MLX prefill at 15K: 1.650 ms/tok. Ollama FA=0 at 15K: 1.774 ms/tok. Difference: 3%.
Two independent runtimes. Same hardware. Same conclusion: the ceiling is memory bandwidth, not attention kernel.
The Flash Attention cliff from Exp 007 was an Ollama/llama.cpp artefact. Not Apple Silicon. Not unified memory. Not the model.

Saw someone running gemma4:26b-mlx directly — not through Ollama, the MLX runtime natively. Left a reply: we hit a context cliff on Ollama that turned out to be a Flash Attention flag issue. Curious if you’ve seen similar behaviour on the MLX backend?

Before they answered, I ran the test myself.

Why This Question Matters

Exp 008 and Exp 010 proved that OLLAMA_FLASH_ATTENTION=1 causes the prefill cliff on Mac Mini M4 Pro. Remove the flag — no cliff through 40K tokens. Keep it — cliff at 32.5K, prefill triple the cost at 15K. The 2×2 factorial was unambiguous.

But “Ollama FA=1 cliffs” and “unified memory architectures cliff under long-context attention” are different claims. Exp 008 and Exp 010 ruled out the flag as the cause. They could not rule out that Ollama’s attention kernel, even with FA=0, had implementation-specific behaviour that happened to match the hardware ceiling.

MLX is Apple’s own ML framework for Apple Silicon. No llama.cpp under the hood. Different attention implementation, different memory access patterns, different compilation target. If MLX also cliffs at the same context size — or at all — the hardware architecture is the binding constraint. If MLX does not cliff, the cliff was a runtime implementation detail.

Pre-registered Confounds

Three confounds documented before running a single token:

1. Quantisation format. Ollama runs GGUF Q4_K_M. The MLX model (mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit) uses OptiQ 4-bit. Different compression schemes. This is not a pure runtime isolation — quantisation affects memory footprint and compute patterns.

2. Active parameter count. The MLX model name includes “A4B” — likely 4B active parameters per forward pass in the MoE sparse activation. Ollama reports 25.8B parameters for gemma4:26b. If the MLX model activates fewer experts per token, generation is cheaper per step. This would make gen t/s non-comparable between runtimes. Prefill scaling (how ms/tok grows with context) remains informative regardless.

3. Tokeniser. MLX uses the HuggingFace tokeniser; Ollama embeds the tokeniser in the GGUF. Actual token counts for the same prompt text differ. The bench script reports the MLX tokeniser’s actual count — actual token counts ran at ~89% of target sizes consistently across all cells.

Phase A — Generation Sweep

5 context sizes, 3 reps each, 60s idle between sizes. Ollama daemon stopped before the run. Rep 1 prefill is authoritative — cold KV, no cache reuse.

Size	Tokens	Rep1 prefill ms/tok	Mean gen t/s	Ollama FA=0/q8_0 gen ref
4K	3,555	1.546	52.41	42.05
8K	7,112	1.628	47.24	44.07
15K	13,333	1.650	38.71	38.41
25K	22,223	1.964	27.00	32.35
35K	31,112	2.161	26.42	27.90

No cliff. Prefill grows steadily from 1.546 ms/tok at 4K to 2.161 ms/tok at 35K — roughly linear, no super-quadratic onset.

The gen t/s pattern is different from Ollama. At 4K, MLX runs at 52.41 t/s — 24% faster than Ollama’s 42.05. By 35K, the gap closes: 26.42 vs 27.90. This is the A4B confound in action. With fewer active parameters per forward pass at short context, autoregressive decoding is cheaper per step. As KV cache overhead grows with context length, the per-step compute advantage shrinks and both runtimes converge near the same memory bandwidth ceiling.

Phase B — Cliff Scan

9 sizes from 20K to 40K in 2.5K steps, 2 reps per size. Cliff threshold: 2× the MLX 15K baseline = 3.300 ms/tok.

Size	Tokens	Rep1 prefill ms/tok	Cliff
20K	17,778	1.859	no
22.5K	20,001	1.961	no
25K	22,223	2.128	no
27.5K	24,445	2.118	no
30K	26,666	2.238	no
32.5K	28,889	2.498	no
35K	31,112	2.464	no
37.5K	33,333	2.518	no
40K	35,555	2.270	no

Peak: 2.518 ms/tok at 37.5K. Threshold: 3.300 ms/tok. Margin: 24%.

No cliff at 32.5K — the exact size where OLLAMA_FLASH_ATTENTION=1 alone triggered a cliff in Exp 010 Condition B. No cliff at 20K — where FA=1 combined with q8_0 pushed Exp 007 past its limit. Smooth growth through 40K.

Hypothesis Verdicts

H1 — CONFIRMED. MLX 15K prefill (1.650 ms/tok) is below the Ollama FA=0/fp16 baseline (1.774 ms/tok). MLX’s unified-memory-native kernels match or marginally outperform the best Ollama configuration.

H2 — CONFIRMED. No prefill cliff through 40K tokens. The FA=1 cliff is absent from MLX entirely.

H3 — CONFIRMED. MLX 25K gen t/s (27.00) falls within the pre-registered ±25% window around Ollama FA=0/fp16 (31.08). Wide tolerance was set because the A4B confound was expected to make gen t/s non-comparable — that is exactly what happened at short context, and convergence by 35K.

What Three Runtimes Now Confirm

Runtime	Config	Cliff?	15K prefill ms/tok
Ollama	FA=1, q8_0	yes (20K)	25.08
Ollama	FA=1, fp16	yes (32.5K)	5.405
Ollama	FA=0, fp16	no	1.774
Ollama	FA=0, q8_0	no	1.694
MLX	default	no	1.650

The cliff appears in every configuration that touches Flash Attention in llama.cpp. It is absent from every configuration that does not — including an entirely independent runtime.

This is the cleanest possible causal isolation short of inspecting the kernel code: vary the runtime, hold the hardware constant, observe whether the cliff follows the flag or the chip. It follows the flag.

The Implication

The Mac Mini M4 Pro does not have an architectural attention scaling problem. It has a 273 GB/s memory bus that Flash Attention’s tiling overhead fights against on unified memory — because the tile pass was designed for discrete GPU SRAM/HBM hierarchies that Apple Silicon does not have.

Remove the flag, or use a runtime that doesn’t implement FA the same way, and the hardware behaves as the memory bandwidth numbers predict: roughly linear prefill growth, no cliff through 40K tokens.

The ceiling for long-context inference on this machine is memory bandwidth. That is a fixed physical constraint. Everything else — runtime choice, flag configuration, quantisation format — is an engineering decision. And engineering decisions can be changed.

Evidence: exp_011_mlx_runtime/ · scientific_log.md

Tags: Local-Llm, Benchmark, Chronos, MLX, Ollama, Apple-Silicon, Gemma4

TL;DR#

Why This Question Matters#

Pre-registered Confounds#

Phase A — Generation Sweep#

Phase B — Cliff Scan#

Hypothesis Verdicts#

What Three Runtimes Now Confirm#

The Implication#