TL;DR — Skip to the numbers if you’re in a hurry

  • The wall is real — and hardware-specific
    • Mac mini M4 Pro: hits it at ~18K tokens. Past that, processing a single input can take 20 minutes.
    • MacBook Pro M5 Max: doesn’t hit it until ~45K tokens — 2.5× further.
  • The speed gap is large
    • At 25K tokens: MBP generates output 4.7× faster than the Mini.
    • MBP at 35K tokens is still faster to process than the Mini at 4K tokens.
  • The wall is a memory bandwidth limit, not a bug
    • Mini: a sharp wall — cross it and performance collapses.
    • MBP: a gentle ramp — performance degrades slowly above the limit.
  • New operational ceilings: Mini <18K tokens · MBP <40K tokens

Every claim on this blog rests on a measurement. And until today, every measurement rested on one machine: the Mac mini M4 Pro.

The 22K token ceiling in ADR-002. The prefill cliff between 25K and 35K tokens from Incident 003. The ~41 t/s generation baseline. The Router/Reducer cascade’s safe operating window. All of it was measured on a single chip, in a single thermal environment, at a single point in time — and treated as architectural truth.

When a MacBook Pro M5 Max entered the environment, the honest question became: are these constants? Or are they properties of one machine?

Experiment 007 is the answer.


The Wager

We pre-registered four hypotheses before running a single benchmark.

H1 (primary): The MacBook Pro M5 Max yields materially higher generation throughput and/or a higher prefill cliff threshold, due to increased memory bandwidth in the Max die.

H2 (throughput): Gen t/s at medium context improves by ≥10% on the MBP.

H3 (adversarial): Sustained laptop thermals degrade MBP performance relative to the Mini’s whisper-quiet, desktop-class thermal steady state.

H4 (cascade): The Router/Reducer cascade produces consistent, grounded answers on both machines — only latency differs.

The pre-registered falsification criteria for H1: rejected if MBP gen t/s falls within ±5% of the Mini’s baseline across ≥4 of 5 context sizes.

Spoiler: H1 was not rejected.


The Setup

The protocol was identical on both machines. Same model weights (gemma4:26b, Q4_K_M). Same Ollama version (0.20.2). Same environment variables. Same fixtures — synthetic English-text padding prompts at calibrated token counts, committed before the first run and never modified. AC power throughout.

Phase A (generation sweep): Five context sizes — 4K, 8K, 15K, 25K, 35K tokens. Three repeats per size, with a fresh model unload and 60-second idle between each size cell. We capture prompt_eval_duration and eval_duration from Ollama’s response to derive prefill ms/tok and gen t/s. Only rep 1 is authoritative for prefill — reps 2 and 3 reuse the KV cache on an identical prompt and return near-zero prefill times.

Phase B (cliff localisation): A sweep from 20K to 120K tokens to locate the exact point where prefill ms/tok exceeds 2× the 15K baseline. For the Mini, we expected the cliff around 25–35K (Incident 003 baseline). For the MBP, we didn’t know — so we extended the fixtures to 120K when Phase A showed no sign of a cliff at 35K.


Phase A: Generation Throughput

TokensMini rep1 prefill ms/tokMBP rep1 prefill ms/tokMini gen t/sMBP gen t/s
~4K3.0330.80134.76~92*
~8K4.8700.77731.3884.01
~15K8.3160.88425.0875.52
~25K24.1541.05714.4066.44
~35K33.7521.24310.7557.85

MBP 4K rep 1 includes cold model load; steady-state measured in reps 2–3 at ~92 t/s.

The headline number is buried in the prefill column. At 35K tokens, the MacBook Pro M5 Max has a lower prefill cost (1.243 ms/tok) than the Mac mini M4 Pro at 4K tokens (3.033 ms/tok).

The Mini at 35K takes 1,190 seconds — nearly 20 minutes — just to process the input. The MBP at the same size takes 48 seconds. Same model. Same weights. Same quantisation. Different die.

Generation throughput tells the same story. At 25K tokens, the Mini generates at 14.4 t/s. The MBP generates at 66.4 t/s — a 4.7× difference. This is not an incremental improvement. It is a different machine class for this workload.


Phase B: Cliff Localisation

The cliff is defined as the point where prefill ms/tok exceeds 2× the 15K baseline. For the Mini: 2 × 8.316 = 16.632 ms/tok. For the MBP: 2 × 0.884 = 1.768 ms/tok.

TokensMini prefill ms/tokMBP prefill ms/tokMini cliffMBP cliff
~20K19.3520.965YESno
~25K24.2351.065YESno
~30K28.9451.143YESno
~35K33.7671.329YESno
~40K38.4931.381YESno
~50K1.929YES
~80K2.819YES
~120K3.826YES

The Mini’s cliff onset is between 15K and 20K tokens — the first Phase B point already exceeds the threshold. The MBP’s cliff onset is between 40K and 50K tokens. The MBP handles 2.5× more context before hitting the bandwidth wall.

At 40K tokens, the Mini prefill takes 26 minutes. At 40K tokens, the MBP prefill takes 59 seconds.


An Unexpected Finding: The Shape of the Cliff

Incident 003-Alpha described the Mini’s prefill behaviour past ~25K tokens as “super-quadratic.” Exp 007 refines that characterisation.

What actually happens is a step function followed by a linear regime. The jump from 15K to 20K tokens is steep (8.316 → 19.352 ms/tok, the cliff onset). But above 20K, the Mini’s prefill grows at a consistent ~0.95 ms/tok per 1K tokens — linear, not super-quadratic.

The MBP shows a similar structure, but with a much flatter slope inside the cliff: ~0.03 ms/tok per 1K tokens above 50K. That is roughly 30× flatter than the Mini. The MBP cliff is a gentle ramp; the Mini cliff is a sharp wall.

This matters for cascade design. A wall means a hard ceiling — cross it and performance collapses. A ramp means graceful degradation — you can exceed the ceiling and still get useful work done, just slowly.


What This Means for the Cascade

The Router/Reducer cascade runs with a 22K token bundle ceiling set in ADR-002, derived from the Mini’s cliff onset. That ceiling was set conservatively to stay below the measured cliff zone.

The revised picture:

MachineCliff onsetSafe operating ceilingADR-002 status
Mac mini M4 Pro~17–20K tokens< 18K tokensCeiling stands — tighten from 22K
MacBook Pro M5 Max~43–50K tokens< 40K tokensCeiling can be raised significantly

The cascade’s 22K ceiling is safe on the MBP — it sits well below the cliff. On the Mini, 22K is above the actual cliff onset; that ceiling should be tightened to 18K. A bundle that would trigger a 20-minute prefill on the Mini completes in under 2 minutes on the MBP.

The Nota Simple vs Catastro mismatch query (Q3 in Phase D) — the CasaSol stress test designed to hit the Mini’s ceiling — should complete on the MBP where it times out on the Mini. That remains to be measured in Phase D.


H1 and H2: Confirmed

H1 is confirmed well above the pre-registered threshold. The MBP cliff threshold is 2.5× higher than the Mini’s, and gen t/s is +200% to +470% across tested context sizes.

H2 is confirmed. Gen t/s at 15K: +201%. At 25K: +370%. Both far above the pre-registered 10% threshold.

H3 (thermal decay) and H4 (cascade portability) are pending Phase C and Phase D, respectively. The fan was audible and the chassis warm during the 110K–120K prefill cells in Phase B — anecdotal evidence that Phase C will surface something real.


The Verdict

The operating envelopes in Chronos were not constants. They were properties of one machine.

The Mac mini M4 Pro remains the production anchor for sustained Nestor sessions — whisper-quiet, always-on, desktop-class thermals. But for workloads that push context size, the MacBook Pro M5 Max is in a different class. Not incrementally better. Architecturally different.

The right question for local AI infrastructure is no longer just “what model?” It is “what model, on what hardware, at what context size?” The answers are not interchangeable.

Evidence: