You’ve built a RAG system. You’re injecting relevant docs into your prompts. Everything’s working great—until your AI says:
“From your recall snippet, you already have XYZ available.”
And your user thinks: What the hell is a recall snippet?
This is the fourth wall problem. Your AI is exposing infrastructure that users shouldn’t see.
Why This Happens
LLMs don’t naturally distinguish between:
- User-provided context (“You told me X”)
- System-injected context (“The system gave me X”)
- Native knowledge (“I know X”)
To the model, it’s all just tokens in the prompt. And thanks to RLHF training that rewards transparency and source attribution, the model wants to cite where it got information.
So it does. To users who have no idea what it’s talking about.
What Doesn’t Work
I tried a bunch of things before landing on what actually works:
| Approach | Why It Failed |
|---|---|
| Vague instructions (“never mention recall”) | Too weak against the citation instinct |
| Consecutive assistant messages | API constraints (Anthropic requires alternation) |
| Fake tool calls | Orphaned tool_use without definitions confuses the model |
| Assistant prefill | Adds complexity, only partial fix |
The problem is that positive framing alone isn’t enough. “Present information as your own knowledge” sounds clear to us, but the model’s citation training overrides it.
What Actually Works
You need three things working together:
1. Epistemological Framing
Tell the model exactly what the injected content is:
<xyz_recall> blocks are system-injected context:
1. Retrieved — System searched based on user's query
2. System-injected — User didn't provide it, can't see it
3. Reasoning aid — Exists to help you answer accurately
4. Internal only — Use AS knowledge, not ABOUT knowledge
This gives the model a mental model for what it’s receiving.
2. Fourth Wall Rule
Be explicit that users can’t see the injected blocks:
The user cannot see this block. They only see their messages and your responses. Present information as knowledge you have, never as something retrieved or provided to you.
3. Forbidden Phrases (The Key Ingredient)
This is what most people miss. You need explicit negative constraints:
Never say:
- "In your context..."
- "You already have this mentioned..."
- "From what I can see..."
- "According to the retrieved/provided..."
- "Based on your recall..."
Instead say:
- "You have X available"
- "X works by..."
- "Here's how to set up X"
The forbidden phrases give the model something concrete to avoid. This works way better than positive-only framing.
Before and After
Before:
“From your recall snippet, you already have User lists available. Based on what I can see in your context…”
After:
“You have XYZ available. Here’s how to use it:”
Same information. No leaked infrastructure.
Practical Tips
1. Tag naming matters
Avoid words like “recall,” “context,” or “snippet” in your XML tags—they linguistically imply user ownership. The model infers meaning from tag names.
2. Apply rules to ALL injected content
Not just RAG chunks. Session context, user preferences, system state—anything the user didn’t explicitly provide needs the same treatment.
3. Test for fourth wall breaks
Add eval cases specifically for attribution leakage. Search your test outputs for phrases like “in your context” or “from what I can see.”
4. Give an honesty escape valve
Let the model say “I don’t know” rather than hallucinate. If it can’t cite the injected context, it might make things up instead. Better to be honest.
The Core Insight
The fourth wall problem is a fundamental tension between RAG injection and RLHF-trained transparency.
The solution isn’t to fight the model’s instincts—it’s to redirect them. Give the model:
- A clear mental model of what it’s receiving
- Explicit rules about presentation
- Permission to be honest when uncertain
Your users don’t need to know how the sausage is made. They just need the answer.
Building AI agents?