TL;DR — Skip to the tables if you’re in a hurry

  • The 20K cliff was not a hardware limit. It was OLLAMA_FLASH_ATTENTION=1.
    • Remove the flag: no cliff through 40K tokens on Mac Mini M4 Pro.
    • Keep the flag alone (no q8_0): cliff at 32.5K, prefill 3× worse at 15K.
    • Add q8_0 to FA=1: cliff drops to 20K — Exp 007’s original number.
  • q8_0 alone is benign. Actually marginally better.
    • FA=0 + q8_0: no cliff, +5% gen t/s vs fp16, smaller KV memory footprint.
    • This is now the production configuration.
  • The Mac Mini’s true ceiling is >40K tokens on-wire.
    • Every cascade design decision made since Incident 003-Alpha can be revisited.
  • Flash Attention was designed for SRAM/HBM hierarchies. Apple Silicon doesn’t have one.

Every architectural decision this project has made about context size rests on a single measurement from March 2026: the Mac Mini M4 Pro hits a prefill cliff at ~22K tokens. Past that point, prefill latency goes super-quadratic. At 35K tokens, a single model call takes 20 minutes.

That measurement was real. The data was clean. The experiment was reproducible.

The cause of the cliff was not what we thought.


The Configuration That Wasn’t Checked

Experiment 007 (May 29, 2026) pre-registered its hardware comparison against a clear baseline: OLLAMA_FLASH_ATTENTION=0, default fp16 KV cache. The pre-registration is committed and timestamped. The runtime config was not verified against it before the runs started.

The Ollama LaunchAgent plist had been edited roughly a month before Exp 007 ran, adding OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0. The intent was performance optimisation. The effect on an experiment that hadn’t been designed yet was not considered, because experiments that haven’t been designed yet have no requirements.

miktam02 cannot read another user’s process environment on macOS. The Ollama daemon runs under miktam. ps ewww on that process returns nothing across the user boundary — macOS enforces this. The standard check failed silently, returning a clean negative that meant nothing.

The right source of truth was the Ollama log, which records the full server config at every startup. When Experiment 008 was scoped and about to start, the log check found this entry from May 29:

2026-05-29T14:19:50+02:00 level=INFO source=gpu.go msg="server config"
OLLAMA_FLASH_ATTENTION:true FlashAttention:Enabled KvCacheType:q8_0

The Exp 007 benchmark runs started at 14:23 CEST. That log line is from 14:19 — the daemon startup four minutes before. Every measurement in Exp 007 ran with FA=1 and q8_0 active. The pre-registered FA=0 baseline never existed.

This is Incident 007-Alpha: a protocol deviation that altered the experiment outcome. The incident is documented in the Chronos scientific log, the affected experiments are annotated, and log-based config verification before every bench run is now a mandatory protocol step, not an optional sanity check.


Experiment 008: What the Baseline Actually Is

The original Exp 008 hypothesis — “FA+q8_0 can push the cliff beyond 30K” — collapsed before the first measurement ran. You cannot test whether flags help when the baseline was already running those flags. The experiment was reformulated: establish FA=0/fp16 performance on the Mini and compare it against the known FA=1+q8_0 result from Exp 007.

The daemon was updated, the config verified via the log before running anything, and the bench started.

Exp 008 — FA=0, fp16 KV, Mac Mini M4 Pro:

ContextPrefill ms/tokGen t/sCliff
4K1.49742.05
8K1.57044.07
15K1.77436.97
25K1.86931.08no
35K2.24126.69no
40K2.215no

No cliff through 40K tokens. At 35K, gen t/s was 26.69. Exp 007’s number at 35K — the one that shaped the cascade ceiling — was 10.75 t/s. A 2.5× difference. Same hardware, same model weights, same quantisation. Different config.

The 18K cliff was a property of OLLAMA_FLASH_ATTENTION=1, not of the hardware.


Experiment 010: Isolating the Culprit

Knowing the combination (FA=1 + q8_0) caused the cliff does not identify which component caused it. Two independent variables, one cliff. The clean answer requires running the missing two cells of the 2×2 factorial.

Conditions A and D were already measured. Conditions B and C were the unknowns:

fp16 (default)q8_0
FA=0A — Exp 008 ✓ no cliffC — Exp 010
FA=1B — Exp 010D — Exp 007 ✓ cliff at 20K

Condition B (FA=1, fp16): flash attention alone, without KV quantisation.
Condition C (FA=0, q8_0): KV quantisation alone, without flash attention.

Each condition ran Phase A (5 context sizes, 3 reps each, 4K–35K) and Phase B (9 sizes, 2 reps each, 20K–40K). Fresh model unload and 60-second idle between each size cell to prevent KV cache reuse from contaminating prefill measurements.

Condition B — FA=1, fp16 KV:

ContextPrefill ms/tokGen t/sCliff
4K1.58337.80
8K2.01334.19
15K5.40528.80— (baseline)
25K8.43221.66yes
32.5K11.24717.53yes
35K12.28616.26yes
40K13.793yes

Cliff onset: 32.5K tokens. FA=1 alone causes a cliff — and at 15K, before any cliff, it already more than triples prefill latency (1.774 → 5.405 ms/tok) relative to the FA=0 baseline. Adding q8_0 to FA=1 (Exp 007, Condition D) moves the cliff onset down from 32.5K to 20K. The two flags compound.

Condition C — FA=0, q8_0:

ContextPrefill ms/tokGen t/sCliff
4K1.49742.05
8K1.57044.07
15K1.69438.41— (baseline)
25K1.85932.35no
32.5K2.05025.83no
35K2.11027.90no
40K2.21326.39no

No cliff through 40K tokens. Gen t/s at 35K: 27.90 — marginally better than the FA=0/fp16 baseline (26.69). q8_0 alone does not degrade performance. The reduced KV memory footprint appears to free bandwidth during autoregressive decoding, producing a consistent 4–5% gen t/s improvement across context sizes.

The complete picture:

ConditionConfigCliff onset35K gen t/s15K prefill ms/tok
A (Exp 008)FA=0, fp16> 40K26.691.774
B (Exp 010)FA=1, fp1632.5K16.265.405
C (Exp 010)FA=0, q8_0> 40K27.901.694
D (Exp 007)FA=1, q8_020K10.7525.08

FA=1 is the necessary and sufficient cause. q8_0 is benign in isolation and harmful only when paired with FA=1, where it shifts the cliff onset down a further 12.5K tokens.


Why Flash Attention Hurts on Apple Silicon

Flash attention was designed to solve a specific problem on discrete GPU architectures: reducing HBM traffic during attention computation by keeping KV blocks in SRAM and computing attention in tiles. On an A100 or H100, HBM bandwidth is the binding constraint and SRAM is fast but scarce. The tile pass trades sequential memory reads for bounded on-chip computation — a favourable exchange when off-chip bandwidth is the bottleneck.

Apple Silicon has no SRAM/HBM hierarchy. The GPU, CPU, Neural Engine, and memory controller share a single unified DRAM pool. The M4 Pro’s memory bandwidth is substantial — 273 GB/s — but it is flat. There is no on-chip fast memory to hide round-trips to. Flash attention’s tiling overhead — additional synchronisation, block-boundary computations, non-sequential access patterns — applies in full on unified memory without the bandwidth benefit it was designed to capture. On this architecture, sequential KV access is not the bottleneck. FA turns a non-bottleneck into a compute and synchronisation problem that is strictly worse.

The MoE architecture of gemma4:26b amplifies this. MoE routing is inherently non-sequential: each token activates a sparse subset of experts, and the weight access pattern is irregular across the expert matrix at every forward pass. Flash attention’s tiled block computation assumes and optimises for sequential attention patterns. Layered on top of an already-irregular MoE dispatch, the conflict compounds at every context size — visible in the 3× prefill overhead at 15K tokens, before any cliff, whenever FA=1 is active.

This is not a criticism of flash attention as a technique. On discrete GPU hardware it is the right choice. On Apple Silicon unified memory running a MoE model, it is the wrong flag for the wrong architecture.


The Implication

The Mac Mini M4 Pro’s operational ceiling under optimal configuration — FA=0, q8_0 — is greater than 40K tokens on-wire, confirmed across all four conditions of a pre-registered 2×2 factorial.

Every cascade design decision made since Incident 003-Alpha rested on a measured ceiling of 18–22K tokens. That ceiling was real in the sense that the cliff was reproducible and the data was clean. The assumption embedded in the measurement — that the daemon was running at default flags — was never verified. It was a configuration property that was diagnosed as a hardware property.

The 4K-per-call watcher budget is now a quality constraint, not a hardware constraint. The 22K bundle ceiling is not an architectural floor — it is a conservative choice made under different information. For workloads that need more context per call, the full 40K range is accessible on the Mini without hitting a cliff.

The production daemon has been updated: FA=0, q8_0. Optimal configuration per Condition C — no cliff, marginally better throughput, reduced KV memory footprint. The config that was never intended to be running turns out to be the right one.

The right question before concluding that hardware cannot handle something: what is in the daemon config, and when was it last verified against what the experiment expects?


Evidence: exp_008_flash_attention/ · exp_010_fa_isolation/ · scientific_log.md

Tags: Local-Llm, Benchmark, Chronos, Flash-Attention, Apple-Silicon, Gemma4