Documentation lies. Not through malice — through drift.

A feature ships. The build log gets a session note. The BRIEF does not. Six commits later, the architecture section still describes what was planned in March. The compliance pack shows a draft DPA when the final template has been sitting in compliance/ for two weeks. Nobody updated the corpus count after the witnessing pipeline landed twelve new listings. The code is ahead of the docs by a widening margin, and the gap compounds silently because nobody reads the whole project often enough to notice.

The traditional fix is a human review. Reviews are slow, expensive, and biased toward what the reviewer already knows to look for. The question worth asking is whether a local model can do this work instead — not as a one-off, but on a recurring cadence, before every merge.

The answer is yes, but the design of the pipeline matters more than the model.


The naive approach is to dump everything into one prompt: BRIEF, MOAT, TODO, the last twenty git commits, and ask “what’s wrong?” This works with frontier models. With a local 26B model it degrades above about 8K input tokens — the model starts losing early document content, confabulates connections that do not exist, and misses the subtle mismatches that are the interesting findings. The lost-in-the-middle problem is well-documented. It shows up at lower thresholds with open-weight models than with frontier APIs, and it shows up harder when the task requires recall across multiple document types rather than a single coherent text.

The fix is to break the work into stages. Each stage produces a small, structured intermediate that the next stage reads. No single LLM call sees more than 4K input tokens. The intermediaries are saved to disk after every step, which means the pipeline is resumable and every output is an auditable artefact.

Five stages. The first extracts a flat JSON list of commitments from the intent documents — BRIEF, MOAT, TODO. Each commitment gets a stated status: done, in-progress, pending, unknown. The second extracts a flat JSON list of shipped artefacts from the build log and the last sixty git commits. The third stage is deliberately deterministic — no model involved, just a structured comparison of the two lists, producing underevidenced commitments and underdocumented artefacts. The fourth stage applies three adversarial personas — investor, DPO, engineer — each in a separate call, each receiving only the gap list and a focused system prompt. The fifth synthesises the persona findings into a ranked report.

The total token cost for a medium-sized project is around 15,000 tokens across six calls. The wall time on a Mac Mini M4 Pro running gemma4:26b is approximately 270 seconds.


The first production run against CasaSol produced four underevidenced commitments and seven underdocumented artefacts. Three of the four underevidenced findings were genuine.

The Reducer model had been swapped from qwen3.5:35b to gemma4:26b in May. The build log had a session note describing the change. No Architectural Decision Record existed anywhere in the repository. The watcher identified the mismatch because the evidence had a commit with “swap Reducer model” in the message and no matching commitment in BRIEF or MOAT. This is exactly the class of drift that human reviewers miss because they remember making the decision and assume it is documented.

The BRIEF status block was describing a project that no longer existed. It said 58 listings; the actual count was 70. It did not mention the witnessing pipeline, the DPA template, or the four-language public website, all of which had shipped in the preceding weeks. The watcher flagged each as an underdocumented artefact — a thing that appeared in commits without a corresponding commitment in the intent documents.

The corpus count claim was unverifiable from git alone. “58 enriched listings” appeared in BRIEF.md as a stated fact. It was accurate when written but had no evidence trail — no commit, no load script output, nothing a reader could check. The fix was to cite the load_db.py output directly in the BRIEF: “70 loaded, 0 errors, 2026-06-06.” A small change that makes the claim auditable.

Three of the seven underdocumented artefacts were false positives, and their anatomy is as interesting as the real findings.

The RoPA was flagged as absent because the watcher did not ingest the compliance/ directory. compliance/01-ropa.md exists and is complete. The pipeline configuration was too narrow — it read BRIEF, MOAT, and TODO but not the compliance pack index. Adding compliance/README.md to the intent sources list fixes this for the next run. The false positive revealed a gap in the watcher, not in the project.

Event registration was flagged as underevidenced: “Registration secured for Startup OLÉ Marbella” appears in BRIEF with no commit to prove it. This is a category error. Not every commitment produces a commit. The watcher does not yet know the difference between code artefacts and external actions. This is a real limitation worth understanding rather than fixing immediately.

The Adversarial Watcher itself was flagged as undocumented. It was created in the same session as the run that critiqued it. Temporal artefact. Dismissed.


The pattern has several named ancestors. LLM-as-judge uses a model to evaluate another model’s output. Reflexion has an agent critique its own trajectories. Constitutional AI applies a set of principles to model outputs and refines them. The dark factory framing from Microsoft Build 2026 — “live factory is people reviewing code, dark factory is agents reviewing it” — is the closest description of what this pipeline does at a conceptual level. ASSERT, the eval framework presented at the same conference, is structurally analogous: Systematize, Generate test sets, Run inference, Score against policy. The watcher’s five stages map directly, applied to project artefacts rather than model outputs.

The property this pipeline has that the literature mostly treats separately is the combination of staged intermediaries with a deterministic gap synthesis step. The deterministic step is not an optimisation — it is a requirement. A language model doing set comparison over two lists of twenty items each is neither reliable nor cheap. The structured diff is. Keeping the non-reasoning work outside the model is what makes the recall-heavy steps tractable at 4K tokens each.


This post is a baseline. The pattern gets more interesting with more data points.

What I am looking to collect: real examples of staged adversarial pipelines in production, not single-shot critics; false positive taxonomies across different project types; evidence about which personas reliably find which classes of problem; and trigger design — the watcher is currently invoked manually before merges, but the right automatic trigger is an open question. A failed test, a commit count threshold, a time elapsed since the last run — each produces a different false positive profile.

The watcher is project-agnostic. The evidence from Run 001 is at github.com/miktam/local-first-ai/tree/main/tasks/chronos/watcher_run_001 — manifest, intermediary JSONs, annotated notes. If you have built something in this space, or tried and failed, the finding is worth recording either way.