Every few weeks a team shows us a new system prompt. It is longer than the last one, full of capital letters and the word NEVER. They are sure this version finally stops the model from leaking data or calling the wrong tool. It does not, and it never will, because they are solving the wrong problem.
A language model treats every token it reads as input. It does not have a separate, privileged channel for instructions and a lower-class channel for data. When your application pastes a web page, a support ticket, or a PDF into the context, the model reads the attacker’s text with exactly the same trust it gives your own rules.
That means an instruction buried in retrieved content can override the one you wrote, no matter how forcefully you wrote it. You are not in an argument the model can referee. You handed both sides the same microphone.
If untrusted text and trusted instructions share one context, you have already lost. The only question is how much.
The durable fixes are structural, and they sit outside the prompt:
When we red-team an AI system, we do not grade the system prompt. We map the trust boundaries, then attack across them: indirect injection through retrieved documents, tool-call hijacking, and data exfiltration through the model’s own outputs. The findings we hand back are architectural, because that is where the fix has to happen.
Better wording buys you a day. Better structure buys you the year.