Mlx | Local First AI

TL;DR MLX does not cliff through 40K tokens on Mac Mini M4 Pro. MLX prefill at 15K: 1.650 ms/tok. Ollama FA=0 at 15K: 1.774 ms/tok. Difference: 3%. Two independent runtimes. Same hardware. Same conclusion: the ceiling is memory bandwidth, not attention kernel. The Flash Attention cliff from Exp 007 was an Ollama/llama.cpp artefact. Not Apple Silicon. Not unified memory. Not the model. Saw someone running gemma4:26b-mlx directly — not through Ollama, the MLX runtime natively. Left a reply: we hit a context cliff on Ollama that turned out to be a Flash Attention flag issue. Curious if you’ve seen similar behaviour on the MLX backend? ...