We Tried to Replace Claude with a Local Critic. Here's Exactly Where It Failed.

Human project reviews are slow. The bottleneck is not judgment — it is context reconstruction. Before you can criticise anything, you spend twenty minutes remembering where you left off. The question we asked: can a local 26B model serve as a recurring adversarial QA critic that catches real problems, not just surfaces obvious gaps?

Enter Experiment 009.

The Setup

Two critics. Same project context. Fixed evaluation schema. No collaboration between runs.

Context bundle (collected fresh for each run):

git log --oneline -20 — recent commit history
Last 100 lines of BUILD_LOG.md
First 80 lines of BRIEF.md
First 80 lines of strategy/TODO.md

Three adversarial personas, applied in sequence by both critics:

Persona	Question
Sceptical investor	Is the moat real or just marketing copy?
DPO / AEPD auditor	Will the compliance architecture survive a real audit?
Competing engineer	How would I replicate this in a weekend?

Output: fixed JSON schema — holds_up, weak, missing per persona, plus top_actions and severity_counts. Same schema for both critics. Both outputs committed to Chronos before scoring.

The project under review: CasaSol — a local-first AI property intelligence platform for Costa del Sol real estate. Full context in the CasaSol post.

The Numbers

Metric	Claude Sonnet 4.6	gemma4:26b
Runtime	inline	84.7 s
Input tokens	~4,750 (est)	8,966 (exact)
Output tokens	~1,170 (est)	902 (exact)
Speed	—	28.7 tok/s
Issues raised	18	18

Token visibility asymmetry is itself a data point. Ollama returns exact counts from the API (prompt_eval_count, eval_count, eval_duration). Claude’s counts are estimated (context chars ÷ 4). When you run local models, you know exactly what you spent. When you run cloud models, you approximate. For a recurring QA gate, this matters.

Where They Agreed: The Compliance Layer

Both critics identified the same three compliance gaps, near-exactly:

Issue	Claude	gemma4
No DPA template for the agency-to-CasaSol relationship	✓	✓ exact match
No DSAR (Data Subject Access Request) procedure	✓	✓ exact match
No DPIA for the VLM-based extraction process	✓ (via blur-not-implemented)	✓ explicit

On the legal and compliance surface, a 26B model running locally is as good as a frontier model. The compliance gaps are findable by pattern-matching on what should be present but isn’t — document types that have known names and known requirements. The model doesn’t need to reason about the project; it needs to know what a compliant system looks like.

Where They Diverged: The Engineering Layer

Claude flagged the highest-severity engineering issue:

The VLM witnessing pipeline — described in the deck and the BRIEF as the primary moat component, the second pillar of competitive advantage alongside the filesystem firewall — does not exist in any commit. The corpus is built from text seed files.

gemma4 never found this. Instead, it reasoned about replicability: “The witnessing pipeline is essentially a manual OCR/VLM task that can be replicated with any high-end VLM API.” That is a valid concern. But it is a different question. gemma4 read the BUILD_LOG and BRIEF, accepted the claim that witnessing was implemented, and then critiqued it on competitive grounds.

Claude read the same BUILD_LOG claim (“witnessing reframe — image capture is core MVP”) and then cross-referenced it against the git log. No VLM commit exists. The claim is documented; the code is not. Claude flagged the gap between the two.

This is the distinction the experiment was designed to surface: pattern-matching on what is present versus detecting the gap between documented intent and implemented reality.

The “False Positives” Are Not False

gemma4’s ~50% false-positive rate (issues it raised that Claude did not) sounds like noise. It isn’t. On review, most are valid gaps Claude simply didn’t explore:

Filesystem firewall has no cryptographic or network-level isolation — it is a deployment choice, not a technical barrier
Manual witnessing labor is hard to scale without headcount
The Redactor’s PII leakage rate has not been formally measured (this is what exp_007 will cover)
No hardware deployment cost model for a distributed agency network

gemma4 explored different territory. The high false-positive rate means two critics with different search strategies — not one critic being wrong.

The Verdict

Metric	Result	Threshold	Pass?
Issue overlap rate (high-severity)	~50%	≥60%	FAIL
False positive rate	~50%	≤30%	FAIL

Overall: FAIL as a drop-in replacement. Useful as a complementary signal.

gemma4 catches compliance gaps cheaply and locally. A recurring run on every significant commit costs nothing — no API, no network, 84 seconds on an M4 Pro. Claude caught the impl-vs-docs gap that gemma4 missed. That finding — the one that actually changes what you demo at a booth in eleven days — is the one that required something beyond pattern matching.

The Open Question

The compliance layer is findable by knowing what documents a compliant system requires. The implementation layer requires knowing what was claimed in the past and checking whether it was built. Is that memory? Reasoning? Context length? Something else?

We don’t have a clean answer. What we have is a reproducible setup, two committed artefacts, and a scoring log. The next run can change one variable — context window, model size, prompt design — and measure the delta.

That is what Chronos is for.

Evidence artefacts:

HYPOTHESIS.md — pre-registered before execution
claude_20260606_134500.json — Option A output
gemma4_26b_20260606T131205.json — Option B output
EXECUTION_LOG.md — scoring, observations, verdict
critic.py — Option B script (run locally against any project)

The Setup#

The Numbers#

Where They Agreed: The Compliance Layer#

Where They Diverged: The Engineering Layer#

The “False Positives” Are Not False#

The Verdict#

The Open Question#