Flash-Attention

TL;DR — Skip to the tables if you’re in a hurry The 20K cliff was not a hardware limit. It was OLLAMA_FLASH_ATTENTION=1. Remove the flag: no cliff through 40K tokens on Mac Mini M4 Pro. Keep the flag alone (no q8_0): cliff at 32.5K, prefill 3× worse at 15K. Add q8_0 to FA=1: cliff drops to 20K — Exp 007’s original number. q8_0 alone is benign. Actually marginally better. FA=0 + q8_0: no cliff, +5% gen t/s vs fp16, smaller KV memory footprint. This is now the production configuration. The Mac Mini’s true ceiling is >40K tokens on-wire. Every cascade design decision made since Incident 003-Alpha can be revisited. Flash Attention was designed for SRAM/HBM hierarchies. Apple Silicon doesn’t have one. Every architectural decision this project has made about context size rests on a single measurement from March 2026: the Mac Mini M4 Pro hits a prefill cliff at ~22K tokens. Past that point, prefill latency goes super-quadratic. At 35K tokens, a single model call takes 20 minutes. ...