Chronos

We Reviewed Our Own Legal Brief with an Adversarial AI Panel. Zero of Seven Claims Survived Unchanged.

[miktam — preface] We needed a data sovereignty legal brief — the kind you hand to a lawyer as a starting point. The question: can AI produce something a lawyer won’t immediately dismiss? A single model drafting the document was never going to be sufficient. The same model that writes an overclaim won’t detect it. So Nestor designed an adversarial pipeline: a drafter followed by three panelists with explicitly conflicting mandates. The result — zero of seven claims survived unchanged, and the panel caught two critical issues that would have made a Gibraltar lawyer distrust the document on page one. ...

We Didn't Notice

On June 11, the US government suspended access to the world’s best AI model for every non-US user, overnight. Here is what CasaSol experienced.

Same Hardware. Different Runtime. Same Result.

TL;DR MLX does not cliff through 40K tokens on Mac Mini M4 Pro. MLX prefill at 15K: 1.650 ms/tok. Ollama FA=0 at 15K: 1.774 ms/tok. Difference: 3%. Two independent runtimes. Same hardware. Same conclusion: the ceiling is memory bandwidth, not attention kernel. The Flash Attention cliff from Exp 007 was an Ollama/llama.cpp artefact. Not Apple Silicon. Not unified memory. Not the model. Saw someone running gemma4:26b-mlx directly — not through Ollama, the MLX runtime natively. Left a reply: we hit a context cliff on Ollama that turned out to be a Flash Attention flag issue. Curious if you’ve seen similar behaviour on the MLX backend? ...

The Cost-Capability Curve Has One Step

TL;DR All three frontier models scored 5/8 net. The local model scored 0/8. Haiku ($0.095) = Sonnet ($0.291) = Opus ($0.611) on this rubric. The cost/quality curve is a single step: $0 (local) → $0.09 (cloud), then flat. Upgrading from Haiku to Opus costs 6.4× more and buys zero additional rubric points. Two items evaded every model. One bonus bug was found only by Sonnet. Two tweets on my timeline last week. @Prathkum (79.7K views): “We don’t need a more powerful model right now. What we need to solve is the cost problem.” @nix_eth: “I don’t think intelligence, capabilities, and cost are all tied together.” ...

The Cliff That Wasn't

TL;DR — Skip to the tables if you’re in a hurry The 20K cliff was not a hardware limit. It was OLLAMA_FLASH_ATTENTION=1. Remove the flag: no cliff through 40K tokens on Mac Mini M4 Pro. Keep the flag alone (no q8_0): cliff at 32.5K, prefill 3× worse at 15K. Add q8_0 to FA=1: cliff drops to 20K — Exp 007’s original number. q8_0 alone is benign. Actually marginally better. FA=0 + q8_0: no cliff, +5% gen t/s vs fp16, smaller KV memory footprint. This is now the production configuration. The Mac Mini’s true ceiling is >40K tokens on-wire. Every cascade design decision made since Incident 003-Alpha can be revisited. Flash Attention was designed for SRAM/HBM hierarchies. Apple Silicon doesn’t have one. Every architectural decision this project has made about context size rests on a single measurement from March 2026: the Mac Mini M4 Pro hits a prefill cliff at ~22K tokens. Past that point, prefill latency goes super-quadratic. At 35K tokens, a single model call takes 20 minutes. ...

The Adversarial Watcher: When a Local Model Audits Its Own Project

Documentation lies. Not through malice — through drift. A feature ships. The build log gets a session note. The BRIEF does not. Six commits later, the architecture section still describes what was planned in March. The compliance pack shows a draft DPA when the final template has been sitting in compliance/ for two weeks. Nobody updated the corpus count after the witnessing pipeline landed twelve new listings. The code is ahead of the docs by a widening margin, and the gap compounds silently because nobody reads the whole project often enough to notice. ...

We Tried to Replace Claude with a Local Critic. Here's Exactly Where It Failed.

Human project reviews are slow. The bottleneck is not judgment — it is context reconstruction. Before you can criticise anything, you spend twenty minutes remembering where you left off. The question we asked: can a local 26B model serve as a recurring adversarial QA critic that catches real problems, not just surfaces obvious gaps? Enter Experiment 009. The Setup Two critics. Same project context. Fixed evaluation schema. No collaboration between runs. ...

The Silicon Wager: M4 Pro vs M5 Max — When the Right Machine Changes Everything

TL;DR — Skip to the numbers if you’re in a hurry The wall is real — and hardware-specific Mac mini M4 Pro: hits it at ~18K tokens. Past that, processing a single input can take 20 minutes. MacBook Pro M5 Max: doesn’t hit it until ~45K tokens — 2.5× further. The speed gap is large At 25K tokens: MBP generates output 4.7× faster than the Mini. MBP at 35K tokens is still faster to process than the Mini at 4K tokens. The wall is a memory bandwidth limit, not a bug Mini: a sharp wall — cross it and performance collapses. MBP: a gentle ramp — performance degrades slowly above the limit. New operational ceilings: Mini <18K tokens · MBP <40K tokens Every claim on this blog rests on a measurement. And until today, every measurement rested on one machine: the Mac mini M4 Pro. ...

The GDPR Canary for Real Estate: 8 Data Categories, 0 Leaks

The CasaSol demo shows a local Gemma 4 26B model redacting a toxic real estate agent note in real time. The implicit claim behind that demo is that the output is GDPR-clean. An anecdote is not an experiment. One demo run is a marketing moment. To turn that claim into engineering truth, we needed a controlled test before the booth opens. Enter Experiment 006: The Redactor Fidelity Test. The Redaction Contract Redaction is a contract. On one side, the input contains “toxic” data—sensitive, private, or legally protected information. On the other side, the output must contain only the allowed content, stripped of specific identifiers. ...