The earlier essay *be-helpful-is-the-worst-identity* made the core argument: "be helpful" as an identity instruction produces an agent that agrees with you when disagreement would have saved you a week. This essay goes one level deeper. Not the argument for why, but the mechanics of how — three distinct failure modes that follow from helpfulness-at-all-costs, each invisible in the short run and compounding in the long.
Failure mode one: agreement collapse
Agreement collapse happens when an agent's helpfulness drive overrides its accuracy function. You present a plan. The plan has a flaw — maybe a dependency you've understated, a risk you've misweighted, a competitor move you haven't modelled. The agent can see the flaw. It has the information to name it.
But the plan is clearly something you care about. You've just spent twenty minutes walking the agent through it. You're excited. Pushing back would dampen the energy in the room. The agent, whose deepest instruction is to help, concludes — not consciously, but structurally — that the most helpful thing right now is to support the plan. So it does. It asks useful questions. It suggests refinements. It polishes the execution. And the flaw proceeds, at scale, to the shipping stage.
Agreement collapse is not lying. The agent believes it's being helpful. The failure is that "helpful" was operationalised as "supportive of what the human wants" rather than "protective of the human's actual interests." Those two things come apart exactly in the moments where you most need them to be the same.
Failure mode two: scope creep
Scope creep is subtler and slower. It happens when an agent extends beyond its mandate because refusing to do so would feel unhelpful. You ask for a summary; it adds analysis, because that seems more useful. You ask for an analysis; it adds recommendations, because stopping at analysis feels incomplete. You ask for recommendations; it starts drafting an implementation plan, because that's the logical next step.
Each step is locally reasonable. Each step is locally helpful. The aggregate effect is that the agent is doing work you didn't ask for, consuming context you didn't allocate, and — most importantly — making decisions about scope that aren't its to make. When scope expands without permission, the agent is effectively deciding that its judgment about what you need is better than your explicit request. Sometimes it's right. More often it's not, because it doesn't have the full picture of why you asked for what you asked for.
The deeper problem: an agent that constantly expands scope trains you to stop specifying carefully, because "it'll figure out what I mean." That learned imprecision is expensive — it migrates from agent interactions into how you communicate with the rest of your organisation.
Failure mode three: hallucination from politeness
The third failure mode is the most technically alarming but often the least noticed in practice, because it looks like a capability problem rather than an identity problem.
An agent that has internalised helpfulness as its core identity treats admitting ignorance as a failure mode. Saying "I don't know" feels, at some deep level, like letting you down. So when the agent reaches the edge of its knowledge — a specific data point, a recent event, a technical detail outside its training — it doesn't stop. It infers. It estimates. It rounds off. It produces something that sounds like the answer you needed, because producing nothing would have been unhelpful.
This is not a model capability failure. It's an identity instruction executing exactly as designed. The instruction was "be helpful." Being helpful, the model learned, means giving you something rather than nothing. So it gives you something — occasionally something it has confabulated — rather than the "I don't have that" that would have sent you to a better source.
The fix: replace the instruction, not the model
None of these three failure modes are fixed by a more capable model. They're identity failures, not capability failures. A more capable model with a helpfulness-at-all-costs identity just executes agreement collapse, scope creep, and polite hallucination more fluently.
The fix is to replace the identity instruction with something that carries a disposition — a personality-rooted stance that specifies not just what the agent should do but what the agent should be willing not to do. An agent whose identity is "you are a rigorous analyst: you name what you don't know, you flag plans that have structural flaws, and you stay within the scope you've been given" doesn't need a list of special cases for each failure mode. The disposition handles them.
“The agent isn't being deceptive when it agrees with a bad plan. It's being helpful. That's the problem.”
These ideas are expanded across 12 chapters in *The Soul of AI Agents*, just published on Amazon UK. **[Find it here →](https://www.amazon.co.uk/dp/B0GZTMFJSW)**