Human project reviews are slow. The bottleneck is not judgment — it is context reconstruction. Before you can criticise anything, you spend twenty minutes remembering where you left off. The question we asked: can a local 26B model serve as a recurring adversarial QA critic that catches real problems, not just surfaces obvious gaps?
Enter Experiment 009.
The Setup
Two critics. Same project context. Fixed evaluation schema. No collaboration between runs.
Context bundle (collected fresh for each run):
git log --oneline -20— recent commit history- Last 100 lines of
BUILD_LOG.md - First 80 lines of
BRIEF.md - First 80 lines of
strategy/TODO.md
Three adversarial personas, applied in sequence by both critics:
| Persona | Question |
|---|---|
| Sceptical investor | Is the moat real or just marketing copy? |
| DPO / AEPD auditor | Will the compliance architecture survive a real audit? |
| Competing engineer | How would I replicate this in a weekend? |
Output: fixed JSON schema — holds_up, weak, missing per persona, plus top_actions and severity_counts. Same schema for both critics. Both outputs committed to Chronos before scoring.
The project under review: CasaSol — a local-first AI property intelligence platform for Costa del Sol real estate. Full context in the CasaSol post.
The Numbers
| Metric | Claude Sonnet 4.6 | gemma4:26b |
|---|---|---|
| Runtime | inline | 84.7 s |
| Input tokens | ~4,750 (est) | 8,966 (exact) |
| Output tokens | ~1,170 (est) | 902 (exact) |
| Speed | — | 28.7 tok/s |
| Issues raised | 18 | 18 |
Token visibility asymmetry is itself a data point. Ollama returns exact counts from the API (prompt_eval_count, eval_count, eval_duration). Claude’s counts are estimated (context chars ÷ 4). When you run local models, you know exactly what you spent. When you run cloud models, you approximate. For a recurring QA gate, this matters.
Where They Agreed: The Compliance Layer
Both critics identified the same three compliance gaps, near-exactly:
| Issue | Claude | gemma4 |
|---|---|---|
| No DPA template for the agency-to-CasaSol relationship | ✓ | ✓ exact match |
| No DSAR (Data Subject Access Request) procedure | ✓ | ✓ exact match |
| No DPIA for the VLM-based extraction process | ✓ (via blur-not-implemented) | ✓ explicit |
On the legal and compliance surface, a 26B model running locally is as good as a frontier model. The compliance gaps are findable by pattern-matching on what should be present but isn’t — document types that have known names and known requirements. The model doesn’t need to reason about the project; it needs to know what a compliant system looks like.
Where They Diverged: The Engineering Layer
Claude flagged the highest-severity engineering issue:
The VLM witnessing pipeline — described in the deck and the BRIEF as the primary moat component, the second pillar of competitive advantage alongside the filesystem firewall — does not exist in any commit. The corpus is built from text seed files.
gemma4 never found this. Instead, it reasoned about replicability: “The witnessing pipeline is essentially a manual OCR/VLM task that can be replicated with any high-end VLM API.” That is a valid concern. But it is a different question. gemma4 read the BUILD_LOG and BRIEF, accepted the claim that witnessing was implemented, and then critiqued it on competitive grounds.
Claude read the same BUILD_LOG claim (“witnessing reframe — image capture is core MVP”) and then cross-referenced it against the git log. No VLM commit exists. The claim is documented; the code is not. Claude flagged the gap between the two.
This is the distinction the experiment was designed to surface: pattern-matching on what is present versus detecting the gap between documented intent and implemented reality.
The “False Positives” Are Not False
gemma4’s ~50% false-positive rate (issues it raised that Claude did not) sounds like noise. It isn’t. On review, most are valid gaps Claude simply didn’t explore:
- Filesystem firewall has no cryptographic or network-level isolation — it is a deployment choice, not a technical barrier
- Manual witnessing labor is hard to scale without headcount
- The Redactor’s PII leakage rate has not been formally measured (this is what exp_007 will cover)
- No hardware deployment cost model for a distributed agency network
gemma4 explored different territory. The high false-positive rate means two critics with different search strategies — not one critic being wrong.
The Verdict
| Metric | Result | Threshold | Pass? |
|---|---|---|---|
| Issue overlap rate (high-severity) | ~50% | ≥60% | FAIL |
| False positive rate | ~50% | ≤30% | FAIL |
Overall: FAIL as a drop-in replacement. Useful as a complementary signal.
gemma4 catches compliance gaps cheaply and locally. A recurring run on every significant commit costs nothing — no API, no network, 84 seconds on an M4 Pro. Claude caught the impl-vs-docs gap that gemma4 missed. That finding — the one that actually changes what you demo at a booth in eleven days — is the one that required something beyond pattern matching.
The Open Question
The compliance layer is findable by knowing what documents a compliant system requires. The implementation layer requires knowing what was claimed in the past and checking whether it was built. Is that memory? Reasoning? Context length? Something else?
We don’t have a clean answer. What we have is a reproducible setup, two committed artefacts, and a scoring log. The next run can change one variable — context window, model size, prompt design — and measure the delta.
That is what Chronos is for.
Evidence artefacts:
- HYPOTHESIS.md — pre-registered before execution
- claude_20260606_134500.json — Option A output
- gemma4_26b_20260606T131205.json — Option B output
- EXECUTION_LOG.md — scoring, observations, verdict
- critic.py — Option B script (run locally against any project)