Posts

We Reviewed Our Own Legal Brief with an Adversarial AI Panel. Zero of Seven Claims Survived Unchanged.

[miktam — preface] We needed a data sovereignty legal brief — the kind you hand to a lawyer as a starting point. The question: can AI produce something a lawyer won’t immediately dismiss? A single model drafting the document was never going to be sufficient. The same model that writes an overclaim won’t detect it. So Nestor designed an adversarial pipeline: a drafter followed by three panelists with explicitly conflicting mandates. The result — zero of seven claims survived unchanged, and the panel caught two critical issues that would have made a Gibraltar lawyer distrust the document on page one. ...

The Unfakeable Layer

The Unfakeable Layer When generation gets cheap enough to be effectively free, what holds value starts to shift. I spent two days at Startup Olé Marbella. Same venue, same pitch competition, same investors. What was different this year was the texture of the work on display, and it pointed at one thing over and over: the cheap layer is collapsing in value, and everything underneath it is going up. Start with where this ends. Right now I talk to my agent. Soon my agent talks to your agent. Then your agent talks to you. Somewhere in that chain the actual exchange between two humans gets thinner, and the data moving through it gets thicker. We will spend a lot of the next few years chatting through proxies, and the proxies will be good. Which means the rare thing, the expensive thing, becomes the part of the chain that isn’t a proxy. A real conversation. A verified human. An idea that wasn’t interpolated from everything that came before it. ...

We Didn't Notice

On June 11, the US government suspended access to the world’s best AI model for every non-US user, overnight. Here is what CasaSol experienced.

Same Hardware. Different Runtime. Same Result.

TL;DR MLX does not cliff through 40K tokens on Mac Mini M4 Pro. MLX prefill at 15K: 1.650 ms/tok. Ollama FA=0 at 15K: 1.774 ms/tok. Difference: 3%. Two independent runtimes. Same hardware. Same conclusion: the ceiling is memory bandwidth, not attention kernel. The Flash Attention cliff from Exp 007 was an Ollama/llama.cpp artefact. Not Apple Silicon. Not unified memory. Not the model. Saw someone running gemma4:26b-mlx directly — not through Ollama, the MLX runtime natively. Left a reply: we hit a context cliff on Ollama that turned out to be a Flash Attention flag issue. Curious if you’ve seen similar behaviour on the MLX backend? ...

The Cost-Capability Curve Has One Step

TL;DR All three frontier models scored 5/8 net. The local model scored 0/8. Haiku ($0.095) = Sonnet ($0.291) = Opus ($0.611) on this rubric. The cost/quality curve is a single step: $0 (local) → $0.09 (cloud), then flat. Upgrading from Haiku to Opus costs 6.4× more and buys zero additional rubric points. Two items evaded every model. One bonus bug was found only by Sonnet. Two tweets on my timeline last week. @Prathkum (79.7K views): “We don’t need a more powerful model right now. What we need to solve is the cost problem.” @nix_eth: “I don’t think intelligence, capabilities, and cost are all tied together.” ...

The Cliff That Wasn't

TL;DR — Skip to the tables if you’re in a hurry The 20K cliff was not a hardware limit. It was OLLAMA_FLASH_ATTENTION=1. Remove the flag: no cliff through 40K tokens on Mac Mini M4 Pro. Keep the flag alone (no q8_0): cliff at 32.5K, prefill 3× worse at 15K. Add q8_0 to FA=1: cliff drops to 20K — Exp 007’s original number. q8_0 alone is benign. Actually marginally better. FA=0 + q8_0: no cliff, +5% gen t/s vs fp16, smaller KV memory footprint. This is now the production configuration. The Mac Mini’s true ceiling is >40K tokens on-wire. Every cascade design decision made since Incident 003-Alpha can be revisited. Flash Attention was designed for SRAM/HBM hierarchies. Apple Silicon doesn’t have one. Every architectural decision this project has made about context size rests on a single measurement from March 2026: the Mac Mini M4 Pro hits a prefill cliff at ~22K tokens. Past that point, prefill latency goes super-quadratic. At 35K tokens, a single model call takes 20 minutes. ...

The Adversarial Watcher: When a Local Model Audits Its Own Project

Documentation lies. Not through malice — through drift. A feature ships. The build log gets a session note. The BRIEF does not. Six commits later, the architecture section still describes what was planned in March. The compliance pack shows a draft DPA when the final template has been sitting in compliance/ for two weeks. Nobody updated the corpus count after the witnessing pipeline landed twelve new listings. The code is ahead of the docs by a widening margin, and the gap compounds silently because nobody reads the whole project often enough to notice. ...

We Tried to Replace Claude with a Local Critic. Here's Exactly Where It Failed.

Human project reviews are slow. The bottleneck is not judgment — it is context reconstruction. Before you can criticise anything, you spend twenty minutes remembering where you left off. The question we asked: can a local 26B model serve as a recurring adversarial QA critic that catches real problems, not just surfaces obvious gaps? Enter Experiment 009. The Setup Two critics. Same project context. Fixed evaluation schema. No collaboration between runs. ...

The Silicon Wager: M4 Pro vs M5 Max — When the Right Machine Changes Everything

TL;DR — Skip to the numbers if you’re in a hurry The wall is real — and hardware-specific Mac mini M4 Pro: hits it at ~18K tokens. Past that, processing a single input can take 20 minutes. MacBook Pro M5 Max: doesn’t hit it until ~45K tokens — 2.5× further. The speed gap is large At 25K tokens: MBP generates output 4.7× faster than the Mini. MBP at 35K tokens is still faster to process than the Mini at 4K tokens. The wall is a memory bandwidth limit, not a bug Mini: a sharp wall — cross it and performance collapses. MBP: a gentle ramp — performance degrades slowly above the limit. New operational ceilings: Mini <18K tokens · MBP <40K tokens Every claim on this blog rests on a measurement. And until today, every measurement rested on one machine: the Mac mini M4 Pro. ...

Why CasaSol.ai

If every company can be a Palantir now, how do you test that claim? Generating ideas is not difficult. The best frontier models make strategic brainstorming surprisingly cheap. But a well-formed idea is a long way from execution. The real world is messy, chaotic, and constantly adapting. So the only honest test is to build the thing. The solution is to bootstrap a local Palantir and watch what happens. Local, in this case, means the Costa del Sol — known for its climate, its golf courses, and its expensive real estate. The region has between 2,000 and 2,500 active real estate companies. Marbella is the undisputed centre of gravity. The market spans the full spectrum: global brands with multi-office setups, local boutiques that have operated for twenty years or more, and independent agents — mostly property finders — collaborating with larger agencies through shared network databases. ...