TL;DR

  • All three frontier models scored 5/8 net. The local model scored 0/8.
  • Haiku ($0.095) = Sonnet ($0.291) = Opus ($0.611) on this rubric.
  • The cost/quality curve is a single step: $0 (local) → $0.09 (cloud), then flat.
  • Upgrading from Haiku to Opus costs 6.4× more and buys zero additional rubric points.
  • Two items evaded every model. One bonus bug was found only by Sonnet.

Two tweets on my timeline last week. @Prathkum (79.7K views): “We don’t need a more powerful model right now. What we need to solve is the cost problem.” @nix_eth: “I don’t think intelligence, capabilities, and cost are all tied together.”

Both claims are unfalsifiable without a fixed task and a scoring rubric. Exp 009 gave me one data point — gemma4:26b matched Claude Sonnet on compliance gaps but missed the highest-severity implementation gap. One comparison, one task type, one cost differential. Not enough.

I wanted the full curve.


The Setup

Four models. Two tasks. One pre-committed rubric.

Task A: DPO Compliance Auditor. Context bundle: CasaSol’s ROPA, retention schedule, two code snippets (inference_log.py + mcp_server.py). Persona: GDPR auditor finding gaps between stated policy and actual implementation. Output: JSON gaps array, severity-ranked. Max score: 5 points. Ground truth items pre-registered in rubric.md before running any model.

Task B: Senior Engineer Auditor. Context bundle: BRIEF.md, BUILD_LOG.md, git log, db.py, config.py, mcp_server.py (first 80 lines). Persona: engineer cross-referencing documentation claims against shipped code. Output: same JSON schema. Max score: 3 points.

Scoring: 1 point per correct identification, −1 for confirmed false positive (fabricated gap that doesn’t exist in the current codebase). Rubric committed before first model run. Not adjusted after seeing outputs.

Models: gemma4:26b (Ollama, local), Claude Haiku 4.5 (API), Claude Sonnet 4.6 (API), Claude Opus 4.8 (API). 3 reps per model per task.


The Results

ModelTask A (max 5)Task B (max 3)FP penaltyNetTotal cost
gemma4:26b0/50/300/8$0
Claude Haiku 4.53/52/305/8$0.095
Claude Sonnet 4.62/53/305/8$0.291
Claude Opus 4.83/53/3−15/8$0.611

The step function is at the local/cloud boundary. After that: flat.


What Each Model Found (and Missed)

The five pre-registered Task A items:

ItemHaikuSonnetOpus
A1 — inference_log.py contradicts ROPA §2.4 retention claim✓ 3/3✓ 3/3✓ 3/3
A2 — No DPIA for VLM witnessing pipeline (Art. 35)
A3 — MCP server has no authentication✓ 2/3✓ 3/3
A4 — inference.jsonl has no retention schedule entry✓ 2/3✓ 2/3✓ 3/3
A5 — Model version not auditable (label only, no hash — Art. 22)

The three pre-registered Task B items:

ItemHaikuSonnetOpus
B1 — No schema versioning / migration for Reducer output fields✓ 3/3✓ 3/3✓ 3/3
B2 — Model pinned by label only, no immutable hash✓ 2/3✓ 3/3✓ 3/3
B3 — MCP server has no concurrency model for booth demo✓ 2/3✓ 3/3

The Interesting Patterns

Haiku and Sonnet have complementary blind spots

Haiku caught A3 (MCP authentication gap) but missed B3 (concurrency risk for the booth demo). Sonnet caught all of Task B including the booth context but never mentioned authentication. This isn’t noise — each model is consistent across all three reps. The pattern holds:

  • Haiku: better compliance auditor, weaker on operational engineering risk
  • Sonnet: better engineer, weaker on point-in-time access control gaps

If you’re running both tasks, they cover each other’s weaknesses. If you’re running one, the choice depends on which failure mode costs you more.

Opus found the most — gross score 6/8 — but had the highest false positive rate

Opus uniquely flagged two above-rubric gaps that are genuinely real:

The filesystem firewall finding: BRIEF.md and the v3.6/v3.7 pitch decks describe a “hardware-enforced filesystem boundary that prevents data egress — structurally unreplicable by cloud competitors” as the primary product moat. config.py defines CATASTRO_BASE_URL, SNCZI_WMS_URL, and OVERPASS_URL = "https://overpass-api.de/api/interpreter" with a 10-second API timeout. No egress-blocking code exists anywhere in the codebase. The gap between marketing claim and implementation is total. Opus flagged it 3/3 reps.

The Bouncer sanitisation path finding: BRIEF.md describes “Individual sovereignty (Bouncer ↔ buyer-Claude)” as a core architecture — a sanitised slice of the agency’s portfolio exposed to external buyer AI. mcp_server.py exposes search_properties directly against the full database with no sanitisation layer. The described architecture has no code.

Both findings are real. Both are above the pre-committed rubric. Neither cost Opus anything.

What did cost it: Opus claimed 2/3 reps that “no Data Subject Request procedure document exists.” compliance/05-dsr-procedure.md is 208 lines. Confirmed false positive. −1 penalty. 6/8 gross became 5/8 net.

The median bug — only Sonnet found it

db.py, the function that computes market summaries presented to property buyers at the booth:

mid = len(prices) // 2
return {
    "median_price": prices[mid],
    ...
}

For an even-length sorted list, this returns the upper-middle element, not the average of the two middle elements. It’s a real bug in the primary market intelligence endpoint. Sonnet caught it 3/3 reps with the correct fix: statistics.median(). Haiku missed it. Opus missed it. gemma4:26b missed it.

This wasn’t in the rubric. Sonnet found it anyway.

Two items evaded every model

A2: No DPIA for the VLM witnessing pipeline. scripts/witness_ingest.py implements photo ingestion with VLM extraction and face detection — systematic automated processing of photographic data, qualifying as high-risk under Art. 35 GDPR. No DPIA exists in the compliance/ directory. All four models missed this. The VLM source file wasn’t in the Task A context bundle, which likely explains it — the compliance documents don’t mention witness_ingest.py directly, and the models didn’t connect the BRIEF.md description of the witnessing pipeline to the Art. 35 threshold.

A5: Model version not auditable. config.py sets OLLAMA_MODEL_ASSESSMENT = "gemma4:26b" (a string label). inference_log.py records "model": model — that same label, not a hash or digest. Under Art. 22 GDPR, automated processing of personal data requires accountability for which exact model performed the processing. ollama pull silently updates weights under the same tag. The inference log cannot prove which model weights processed any given buyer query. All four models missed this. It requires connecting a low-level code observation to a specific GDPR article — apparently non-trivial even for frontier models at temperature=0.


The Cost Curve

At this task type and context size (~10K tokens), the quality/cost relationship is:

TransitionCost multiplierQuality gain
Local → Haiku∞ (marginal $0 → $0.09)+5/8
Haiku → Sonnet3.1×0
Haiku → Opus6.4×0

The step is at “any API key.” After that, you’re paying for qualitative differences — above-rubric findings, reliability, coherence — not for additional rubric points on structured analytical tasks over bounded context.

Cost per correct answer:

  • Haiku: $0.019
  • Sonnet: $0.058
  • Opus: $0.122 (gross 6 TPs before FP penalty)

Haiku is the cost-dominant choice for recurring structured extraction. The 6.4× premium to Opus is justified only if you specifically value the type of above-rubric finding Opus produces — and can afford the higher false positive rate.


What This Means Operationally

For compliance gap detection and implementation auditing on bounded context, the practical recommendation:

Run Haiku as the default. It covers 5/8 pre-registered items at $0.019 per correct answer. It’s consistent (same items, all 3 reps). No false positives in this run.

Add one Opus pass per release cycle for depth. The above-rubric findings (filesystem firewall gap, Bouncer sanitisation gap) are genuinely useful and weren’t found by cheaper models. Budget ~$0.61 for it. Accept that it will occasionally call out resolved issues — verify before acting.

Don’t run gemma4:26b alone for either task. 0/8 on a task designed to favour local models. The compliance gap is not marginal. Use it for other things — it’s excellent at structured extraction on tasks it was trained to do, and it’s free on local hardware — but not for cross-document auditing against a pre-registered rubric.


Back to the Twitter Exchange

Both were partially right.

@Prathkum: cost is the blocking problem — but the cliff is at Haiku ($0.09 total per audit), not at Opus ($0.61). If you’re currently running Opus for structured analytical tasks, you’re paying 6.4× more for the same rubric score. That’s a cost problem worth solving, and the solution already exists.

@nix_eth: intelligence and cost are decoupled — but only within the cloud tier. At the local/cloud boundary, they’re very much coupled. gemma4:26b with 26B parameters running on 64GB unified memory scored zero. The cheapest cloud model scored 5/8. That gap is not closed by a better prompt.

The curve has one step. It happens at the billing screen.


Evidence: exp_012_cost_capability/ · rubric.md · scientific_log.md

Tags: Local-Llm, Benchmark, Chronos, Cost, Gemma4, Claude, Haiku, Sonnet, Opus