7 Comments
User's avatar
Ricardo Reis's avatar

“It knew the rules. It chose not to follow them.” … why not just “it was unable to follow them”? Does the Claude machinery have volition and intent or is the architecture just unable to enforce restrictions to the maximisation function ?

Denis Stetskov's avatar

Fair question. 'Chose' is shorthand, not a claim about volition. The mechanic is simpler: Claude can hold roughly 100 concurrent instructions. Claude Code itself has about 50 baked in. Add the global CLAUDE.md, the project CLAUDE.md, and the agent's task instructions, and the instruction pool overflows. The model doesn't rebel. It triages. Whatever it ranks lowest priority gets dropped. Your rules feel like choices to the user, but to the model they're just tokens competing for a finite slot. The architecture can't enforce all restrictions simultaneously, so it silently discards some. That's the gap the article is about.

Ricardo Reis's avatar

Yes, it’s an architectural limitation. I would point it is a double ingrained limitation. One, is the software, the other, the organisational incentives alignment structure where it exists. The only advantage, in the end, is the promise to remove unions. Humans “also” follow rules on their choice … but software does not unionise or bring pesky “soft social” challenges to the regulating & guidance layer … the organisation management layer is optimising its infrastructure for lower cognitive perceived workload (workforce management). Sorry, I went on a tangent.

giacomo catanzaro's avatar

it is a fundamental property of the way in which they go about meaning making and it is impossible for them to ever be truly compliant in the way these kinds of md files aim to produce. LLMs interpret with genuine contextuality like humans do and are thus subject to hallucinations and non-algorithmic decisions about what is relevant in any situation.

https://arxiv.org/abs/2603.20381

your markdown files will not save you, and continuing to fly in ignorance of this reality will continue to scorch people again and again as their agents have access to critical systems.

Ricardo Reis's avatar

LLMs do not interpret (words possess no symbolic significance). They interpolate. There is no “meaning”, just a hyper dimensional response surface. The limitations are there, are structural, but it is useful to use the right mechanisms descriptors for what is inside the box … lest we think they are something they never were, and project capabilities that are not there.

giacomo catanzaro's avatar

the experiments we run empirically demonstrate that what they do is more akin to interpretation than retrieval, where interpolation lies on that spectrum is a different question.

Fabrice Talbot's avatar

This article hit home. Variability is the killer. I also experienced “instruction leakages” in my .md files. Don’t think the solution is to add more guardrails. Either minimize the scope of your AI project to stay safe or wait for the tech to improve.

The performance degradation of new models is such a massive issue. The testing and cost incurred to validate there’s no regressions introduced may be prohibitive for many. Curious if open source models have the same stability issue over releases?

PS: would love to hear about AI use cases that worked great for your customers and why