Apple-Silicon

Same Hardware. Different Runtime. Same Result.

TL;DR MLX does not cliff through 40K tokens on Mac Mini M4 Pro. MLX prefill at 15K: 1.650 ms/tok. Ollama FA=0 at 15K: 1.774 ms/tok. Difference: 3%. Two independent runtimes. Same hardware. Same conclusion: the ceiling is memory bandwidth, not attention kernel. The Flash Attention cliff from Exp 007 was an Ollama/llama.cpp artefact. Not Apple Silicon. Not unified memory. Not the model. Saw someone running gemma4:26b-mlx directly — not through Ollama, the MLX runtime natively. Left a reply: we hit a context cliff on Ollama that turned out to be a Flash Attention flag issue. Curious if you’ve seen similar behaviour on the MLX backend? ...

The Cliff That Wasn't

TL;DR — Skip to the tables if you’re in a hurry The 20K cliff was not a hardware limit. It was OLLAMA_FLASH_ATTENTION=1. Remove the flag: no cliff through 40K tokens on Mac Mini M4 Pro. Keep the flag alone (no q8_0): cliff at 32.5K, prefill 3× worse at 15K. Add q8_0 to FA=1: cliff drops to 20K — Exp 007’s original number. q8_0 alone is benign. Actually marginally better. FA=0 + q8_0: no cliff, +5% gen t/s vs fp16, smaller KV memory footprint. This is now the production configuration. The Mac Mini’s true ceiling is >40K tokens on-wire. Every cascade design decision made since Incident 003-Alpha can be revisited. Flash Attention was designed for SRAM/HBM hierarchies. Apple Silicon doesn’t have one. Every architectural decision this project has made about context size rests on a single measurement from March 2026: the Mac Mini M4 Pro hits a prefill cliff at ~22K tokens. Past that point, prefill latency goes super-quadratic. At 35K tokens, a single model call takes 20 minutes. ...

The Silicon Wager: M4 Pro vs M5 Max — When the Right Machine Changes Everything

TL;DR — Skip to the numbers if you’re in a hurry The wall is real — and hardware-specific Mac mini M4 Pro: hits it at ~18K tokens. Past that, processing a single input can take 20 minutes. MacBook Pro M5 Max: doesn’t hit it until ~45K tokens — 2.5× further. The speed gap is large At 25K tokens: MBP generates output 4.7× faster than the Mini. MBP at 35K tokens is still faster to process than the Mini at 4K tokens. The wall is a memory bandwidth limit, not a bug Mini: a sharp wall — cross it and performance collapses. MBP: a gentle ramp — performance degrades slowly above the limit. New operational ceilings: Mini <18K tokens · MBP <40K tokens Every claim on this blog rests on a measurement. And until today, every measurement rested on one machine: the Mac mini M4 Pro. ...

Should We Stop Asking Local LLMs to Think?

What Adam Smith, neuroscience, and a melting Mac Mini taught me about the real division of cognitive labour. My Mac Mini was dying. Not dramatically — no smoke, no kernel panic. Just a quiet, 24-minute seizure: the fan screaming, and my Telegram bot silently refusing to answer “hello.” I’m Miktam, a software engineer who’s spent the last few months building a local AI assistant on a Mac Mini instead of paying cloud APIs to think for me. ...