Why Planning Breaks Token-Based Systems
When Reasoning Meets Reality
An analysis of planning as the stress test that exposes the limits of token-based reasoning.
Part 3 of 5 in the “Beyond Tokens” series.
Planning as a Stress Test
Essays 1 and 2 established two foundations: next-token prediction lacks persistent state, and world models shift intelligence from generating text to modeling environments. Planning is where these ideas are no longer abstract. It is the point at which architectural choices are forced to reveal their consequences.
Planning requires a system to commit to a sequence of actions whose effects accumulate and interact over time. Each decision constrains the next. Assumptions made early cannot be quietly revised later without cost. For this reason, planning functions as a natural stress test: it exposes whether a system actually maintains a coherent internal model of the world it is operating in.
Token-based agents consistently fail under this pressure. While they perform well when reasoning about the immediate next step, their behaviour degrades rapidly as planning horizons extend. Small inconsistencies propagate, constraints erode, and the system drifts away from the objective it originally set out to achieve—often without recognising that it has done so.
For organisations deploying autonomous or semi-autonomous systems, this is not a theoretical weakness. Planning failures surface as operational risk: failed workflows, brittle automation, retries and rollbacks, and—in physical or financial systems—direct and measurable cost.
The cause is not poor prompting or insufficient scale. It is structural. Token-based systems do not model how the world evolves under action. Planning, by definition, depends on exactly that.
Why Token-Based Planning Is Structurally Fragile
Planning exposes three weaknesses that are largely hidden in short-horizon tasks.
First, token-based systems lack an explicit representation of state. Each step is generated based on prior text, not on a structured model of the environment. When planning spans many steps, the system has no mechanism to ensure that earlier commitments remain valid.
Second, there is no built-in notion of causality. The model does not “know” that one action invalidates another. It can describe causal relationships, but it does not enforce them.
Third, error detection is absent. When a plan goes off course, the system does not recognise the deviation. It continues generating plausible steps as if nothing has changed.
These weaknesses do not disappear with scale. Larger models reduce the frequency of errors, but they do not change the nature of the failure. Planning remains brittle because the architecture cannot support it.
Planning Failure Patterns in Practice
When token-based agents fail at planning, the failures are surprisingly consistent. Three patterns appear repeatedly across domains.
The first is state drift. The agent implicitly assumes that the environment remains as described earlier, even after actions that should have changed it. Dependencies that were removed reappear. Resources that were consumed are assumed to still exist.
The second is constraint erosion. Early in a plan, constraints are acknowledged. Later, they are quietly violated—not because the model “forgets” them linguistically, but because nothing enforces them structurally.
The third is goal dilution. As plans grow longer, intermediate steps begin to optimise for local coherence rather than the original objective. The system remains fluent, but the plan no longer converges.
These are not edge cases. They are the default failure modes of planning without state.
Why This Becomes a Cost Problem
In production systems, planning failures are expensive even when they are recoverable.
In software systems, they manifest as failed deployments, corrupted configurations, or brittle automation that requires human intervention. In operational workflows, they lead to retries, manual overrides, and loss of trust in automation. In financial or industrial systems, the cost can be direct and immediate.
The hidden cost is not just failure, but uncertainty. Teams lose confidence in what systems will do next, which limits how far automation can be pushed.
This is why planning is such a useful lens. It translates abstract architectural limitations into tangible business risk.
World Models as the Planning Substrate
World models address these failures by changing the object of planning itself.
Instead of planning over text, the system plans over state transitions. Actions are evaluated based on their predicted effect on the environment, not on how reasonable they sound.
This enables three critical capabilities:
The system can simulate future states before acting.
Constraints can be enforced as properties of state, not suggestions in text.
Deviations become detectable because predicted and actual states can be compared.
Planning stops being an exercise in narrative continuity and becomes an exercise in causal reasoning.
However, this shift introduces a new requirement. Once planning depends on state transitions, those transitions must be executable and checkable. Otherwise, the system simply replaces one internal abstraction with another.
This is the point at which world models begin to demand enforcement.
Why Planning Forces Executability
Planning is unforgiving. It does not tolerate ambiguity about what actions do, nor does it allow assumptions to be quietly revised once a sequence is underway.
A system can generate plans without consequence. It can imagine futures, narrate strategies, and explain its reasoning in fluent detail. But if those imagined futures cannot be tested, checked, or constrained, planning failures simply move one level deeper. The system appears more deliberate, yet remains detached from the reality it claims to reason about.
This is why planning is the natural bridge from world models to executable systems. The moment a system plans seriously, it is forced to answer a concrete question: how do we know this transition is valid? Not plausible. Not well-phrased. Valid.
That question cannot be answered by language alone. It requires representations that can be executed, rules that can be enforced, and transitions that can be verified against the structure of the environment itself.
Part 4 examines how world models can be learned from observation rather than hand-engineered. Part 5 moves beyond learning altogether. It addresses what planning ultimately demands in practice: world models that can be run, checked, and trusted before action is taken.
Planning breaks token-based systems because it exposes what they lack.
It also points, unambiguously, toward what must replace them.
This is the point in the series where exploration ends. From here on, the question is no longer whether world models matter—but how to build systems that make planning safe to rely on.


