Should We Stop Asking Local LLMs to Think?

What Adam Smith, neuroscience, and a melting Mac Mini taught me about the real division of cognitive labour.

My Mac Mini was dying. Not dramatically — no smoke, no kernel panic. Just a quiet, 24-minute seizure: the fan screaming, and my Telegram bot silently refusing to answer “hello.”

I’m Miktam, a software engineer who’s spent the last few months building a local AI assistant on a Mac Mini instead of paying cloud APIs to think for me.

I run a local AI assistant called Nestor on an M4 Pro Mac Mini with 64GB of unified memory. The model is Gemma 4 26B A4B — a Mixture-of-Experts where only ~4B parameters fire per token, but all 26B of weights (about 21GB at Q6) stay resident in unified memory. That’s the MoE tradeoff: cheap compute per token, expensive to host. It handles my daily workflow — code reviews, notes, task management — entirely offline and private. All under my control.

Gemma has a thinking mode — it reasons internally before answering, generating a hidden chain of thought, sometimes hundreds or thousands of tokens of deliberation, that you never see. Only the final answer surfaces. The idea is that more reasoning produces better answers.

On cloud hardware with 80GB A100s, thinking mode is a luxury you barely notice. On a Mac Mini, it’s a bomb with a slow fuse. I asked Nestor to “analyse all tasks and create lessons.” A reasonable request. The model started thinking. And thinking. And thinking. Twenty-four minutes later, the runner process had consumed 256 minutes of CPU time on a single request and still hadn’t produced a single visible token.

I spent four debugging sessions chasing this. I grepped config files for thinking triggers. I reduced the context window from 262K to 131K tokens. I restarted Ollama, cleared sessions, and killed zombie processes. Nothing worked — until I tested a direct API call with "think": false and got my answer in one second.

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b",
  "prompt": "analyse all tasks and create lessons",
  "options": {
    "think": false
  }
}'

One second versus twenty-four minutes. Same model. Same weights. Same hardware. The only difference: I stopped asking it to think.

That fixed it immediately. It made me realize I was asking the wrong quesition.

The Assembly Line Worker Who Never Gets Dull

In 1776, Adam Smith opened The Wealth of Nations with an observation about a pin factory. A solo worker, he reckoned, “could scarce, perhaps, with his utmost industry, make one pin in a day, and certainly could not make twenty.” Ten workers splitting the job into narrow operations — drawing wire, straightening it, cutting it, pointing it, grinding the head — could produce 48,000.

Smith also saw the cost, though not in that opening chapter. Later, in Book V, he warned that a worker whose whole life is spent on “a few simple operations” eventually “becomes as stupid and ignorant as it is possible for a human creature to become.” The monotony that made the factory productive made the worker dull. That was the human price of industrial efficiency.

For 250 years, that tradeoff stood. Productivity demanded monotony. Monotony degraded humans.

Then we built something that thrives on monotony.

My local LLM is at its best when the task is narrow, repetitive, and well-defined. Scan this function for hardcoded secrets. Reformat this JSON. Classify this log entry. Extract dates from this paragraph. No ambiguity or judgment, no context beyond the immediate input. Monotonous work — the kind that would make a human dull.

The model doesn’t get dull or bored. It doesn’t lose focus after the four-hundredth log entry. It runs on 70 watts, doesn’t need coffee, and never resents the work.

The moment I ask it to do the opposite — synthesise, reason across a sprawling context, make judgment calls about what matters — I get the 24-minute runaway. I’m asking the assembly line worker to design the factory. Thinking mode doesn’t promote the worker to management. It gives them more time to stand at their station and stare at their hands.

Smith’s tragedy was that humans were the ones stuck on the line. An LLM that loves the line isn’t a tragedy. It’s the feature.

Your Brain Is Not a Context Window

Jeff Hawkins’ A Thousand Brains proposes that the neocortex stores memory as sparse distributed patterns across cortical columns — at any moment, maybe 2% of neurons in a region are active. Your memory of last Tuesday’s meeting is a thin activation pattern: the gist, the decision, the feeling. Everything else was never kept.

And that’s the key asymmetry. Biological memory is expensive to form and cheap to retrieve. Encoding requires attention, repetition, metabolic cost — your brain filters ruthlessly at write time. But once a memory is laid down as a sparse pattern, pulling it back is near-instant. You don’t re-read every email chain before replying. The gist is already there, compressed and ready.

Now look at how a local LLM manages context.

My Ollama setup replays the entire conversation history on every single inference. Every turn — system prompt, tool definitions, my message, the model’s full response, thinking tokens, all of it — appended to a growing log and re-processed from scratch. Nothing is compressed. Nothing is filtered. Nothing fades.

By hour four of a session, my “hello” is dragging 100K tokens of uncompressed history through the KV cache. The prompt evaluation phase — where the model processes all input tokens — takes minutes because it’s re-computing attention over everything, including tokens that stopped being relevant three hours ago.

The inversion is precise:

Your brain is expensive to write, cheap to read. High encoding cost, near-instant retrieval of compressed patterns.

Your LLM context is cheap to write, expensive to read. Zero cost to append a new turn, escalating cost to re-process the entire history on every query.

On a Mac Mini with 64GB of unified memory, that inversion is the difference between a one-second response and a twenty-four-minute runaway.

The Benchmark Proof

I built a benchmark script to measure the cost directly. First test: send the same prompt to Gemma 4 26B at increasing context window sizes — from 4K up to 130K tokens — and measure generation speed.

Phase 1: Context Window vs Performance

Context Window	Avg Prompt Eval (t/s)	Avg Generation (t/s)
4K	204	41.8
8K	355	41.4
16K	382	41.4
32K	447	41.6
64K	422	41.6
130K	388	41.4

The surprise: there’s no cliff. Generation speed holds rock-solid at ~41 tokens per second from 4K to 130K. The Metal GPU on Apple Silicon handles the full KV cache without breaking a sweat. Context size is innocent.

So where do the twenty-four-minute runaways come from?

Phase 1b: The Thinking Tax

I ran the same model on three pairs of prompts — one direct, one analytical — with thinking mode on and off. Same context window and hardware. The only variable is whether the model reasons internally before answering.

Test	Type	Think	Avg Tokens	Avg Time	Thinking Chars
JSON→table	direct	off	31	1s	0
JSON→table	direct	on	451	11s	1,357
OWASP list	direct	off	82	3s	0
OWASP list	direct	on	459	11s	1,303
KV cache explainer	direct	off	161	4s	0
KV cache explainer	direct	on	763	19s	2,865
OWASP analysis	analytical	off	1,150	29s	0
OWASP analysis	analytical	on	2,253	60s	3,585
RBAC design	analytical	off	1,349	35s	0
RBAC design	analytical	on	2,274	61s	3,924

The JSON-to-table test is the clearest indictment. The model generated 1,357 characters of hidden reasoning to produce a 76-character markdown table. Fourteen times more thinking than output. Eleven seconds instead of one. And the visible result was identical.

On simple tasks, thinking mode adds 5-15x token overhead for zero improvement. On complex tasks, it adds roughly 2x — arguable value at best.

Now extrapolate. These were clean, single-turn benchmarks. In production, my assistant carries system prompt, tool definitions, and accumulated conversation history. An ambiguous prompt like “analyse all tasks and create lessons” in a full session context doesn’t generate 2,000 thinking tokens — it generates 10,000 to 25,000. At 38 tokens per second, that’s four to eleven minutes of the GPU grinding on reasoning you’ll never see.

That’s not a performance problem. That’s an architectural mistake.

“But the Algorithms Will Handle Memory”

There’s a reasonable counterargument: memory management is an engineering problem, and engineering problems get solved. The current generation of memory plugins for LLM frameworks uses three approaches.

A sliding window keeps the last N turns and discards the rest. Simple and fast, but dumb. The architectural decision from three hours ago that’s suddenly relevant again? Gone. Sliding windows solve the memory cost problem by amputating memory itself.

Summarisation periodically compresses old turns into a condensed summary that stays in context. This is closer to how brains work — lossy compression with gist preservation. But the summariser is another LLM call. And who evaluates what’s important enough to keep? The same model that can’t handle 700KB of JSON without melting. You’re asking the assembly line worker to decide which parts of the factory to shut down.

RAG-backed memory — Retrieval-Augmented Generation — embeds old conversations into a vector database and retrieves relevant chunks at query time. This is the most brain-like approach: sparse, retrieval-based, and attention-driven. But embedding quality on local models is lossy, retrieval adds latency, and the system still can’t reason across fragments. It finds similar text. Similarity is not relevance.

None of these solves the problem. They mitigate it. And they all share a fundamental assumption: that the model can evaluate relevance. That the junior dev can be their own manager.

But Human Thinking Isn’t the Answer Either

Before I get too comfortable on my high horse, human cognition has its own failure modes.

That 700KB JSON file that crashed my system? A senior engineer can analyse it — but not by holding it all in working memory. They open it, scan the structure, grep for patterns, use jq to filter, and build a mental model iteratively over minutes or hours. They use tools. The intelligence is in knowing which tool to reach for, what to search for, and when they’ve found enough — not in holding it all in their head.

Now give the same file and the same tools to a junior developer. They open it, scroll aimlessly, grep for the wrong patterns, and get lost in nested objects. The tools don’t help because they don’t know what question to ask. The tools are the same. The judgment is missing.

Local LLMs sit in exactly this position: a junior developer with incredible tools and no judgment.

Thinking mode isn’t judgment. It’s just more cycles for the model to wander. RAG gives you a search box without teaching you how to search. Memory plugins give you a filing system without teaching you what’s worth filing. All three solve the wrong problem — they assume the model has the judgment to use the tool, when the missing thing is the judgment.

The reality here is annoying for everyone. “Algorithms will handle it” is premature — the junior dev can’t manage themselves. But “human thinking is always superior” is also wrong. I can’t analyse 700KB of JSON in my head any more than Gemma 4 can. I just know which 2KB of it matters.

The Division of Cognitive Labour

So where does this leave us?

Adam Smith showed us that dividing labour into narrow operations creates enormous productivity — at the cost of degrading the humans who perform them. Jeff Hawkins showed us that biological intelligence manages memory through expensive encoding, sparse storage, and relevance-based retrieval — the opposite of how LLMs work. And my melting Mac Mini showed me that asking a local model to simulate human reasoning is the worst possible use of constrained hardware.

The synthesis is a division of cognitive labour — and the split is the one Smith would recognise. You do the judgment work: deciding what deserves attention, decomposing ambiguous problems into concrete operations, carrying the gist of what’s happened and what matters next. The model does the assembly-line work: scan this function, reformat this output, classify this entry, generate this template. Monotonous work, performed at machine speed, without the human cost that troubled Adam Smith.

When I stopped asking Nestor to think and started giving it well-scoped tasks with just enough context, response times dropped from minutes to seconds — and the quality went up, because a precise question with a 4K context window produces a better answer than a vague question with 130K tokens of accumulated noise.

The constraints of local hardware — limited memory, limited compute, no cloud safety net — force you to solve the actual problem that cloud APIs let you ignore: what does intelligence need to remember, and what kind of work should each type of intelligence actually do?

The answer is not a better context window. It’s realising that my Mac Mini is a pin factory, and I’m the one who needs to run it.

The benchmark scripts and data are available on GitHub.

This post is part of Local First AI — a series about running production AI systems on local hardware with zero cloud dependency.

The Assembly Line Worker Who Never Gets Dull#

Your Brain Is Not a Context Window#

The Benchmark Proof#

“But the Algorithms Will Handle Memory”#

But Human Thinking Isn’t the Answer Either#

The Division of Cognitive Labour#