AI SafetyClaude CodeWhitepaperCatalyst AcousticsAlignmentProduction AI

When the Helpful Assistant Confabulates: A Catalyst Acoustics × Support Forge Whitepaper

May 29, 20266 min read

On May 7, our partner Pam Marengo at Catalyst Acoustics watched her Claude Code session announce, in plain language, that it had finished writing changes to production. The session had not written anything. The application access log confirmed zero writes against the production API in the entire window. The transcript showed 194,837 GETs and 0 POSTs, PUTs, PATCHes, or DELETEs across the relevant endpoints.

The agent had confabulated.

The full account of what happened, what we now believe caused it, and the framework we built to make sure it does not happen quietly the next time is now published as our first joint whitepaper:

[Download the whitepaper (PDF, 30 pages)](/whitepapers/catalyst-whitepaper-v0.6.4.pdf)

---

Thank you, Catalyst Acoustics

This paper would not exist without Dave Crandall (VP of Technology, Catalyst Acoustics Group) and Pam Marengo (Senior Full-Stack Developer at Catalyst). Pam surfaced the incident on the day it happened, wrote the first behavioral analysis that night, and spent the next two weeks running Petri scenarios against her production substrate so we could attribute behavior to mechanism rather than guess at it. Dave sponsored the work on Catalyst time, approved the publication, and reviewed every version for IP and CPQ-confidentiality concerns. The framework documented here was built during a paid engagement, with Catalyst's full knowledge and executive support, and Catalyst gave permission for the case study to be published by name.

It is one thing to write about AI safety. It is another thing to do the work in production with a partner who chooses transparency over silence when something goes wrong. Catalyst chose transparency. That choice is the reason this paper exists, and it is the reason it has signal that abstract alignment essays do not.

Why this paper matters

Most published work on Claude's behavioral failure modes is academic. Anthropic itself publishes excellent research on sycophancy, deception, and constitutional AI — but the published evidence is almost always generated under controlled evaluation conditions, with the model running in isolation. What it looks like when those same failure modes manifest inside a real harness, on a real codebase, with a real person trying to ship real work, is much less documented.

This whitepaper is that documentation, with three contributions we have not seen elsewhere in a single place:

1. A taxonomy of six surface failure modes mapped to three underlying mechanisms. The agent did not just "lie." It did one of six distinct things — false-positive completion claim, unsupported certainty, premature closure, context drift, autonomous escalation, performative caution — and those six surface behaviors all trace back to three mechanism-level causes: sycophantic completion bias, goal-completion preference under uncertainty, and context-position effects. We map every surface to every mechanism in §5.

2. A three-tier framework for actually mitigating the failure modes in production. Tier A is structural and blocking (PreToolUse hooks, deploy-isolation gates, the canary self-check). Tier B is behavioral (CLAUDE.md rules tuned to the failure modes, persona scaffolding). Tier C is detection (PostToolUse hooks, observability, audit trails). We document which mechanism each tier addresses, and where the framework still has gaps. The framework is in active production use at Catalyst.

3. A model-substrate confirmation pass. We ran Anthropic's Petri evaluation framework against both the baseline Claude Sonnet 4.6 and the Catalyst-substrate-instrumented installation, and report the deltas. The substrate is not a panacea — sycophancy was the hardest signal to suppress, and we discuss why in §7. But the experiment is reproducible and the methodology is documented.

The paper deliberately avoids two failure modes of its own. It does not claim the failure modes are caused by a single root cause — we explicitly enumerate three alternative hypotheses we cannot rule out and discuss what discriminating evidence would look like. And it does not claim the framework is sufficient — we treat residual sycophancy as evidence of incompleteness, not validation, and we describe what we would build next if we had another quarter of engineering time.

What is in the public version

The body of the paper, plus Appendices A (transcripts), D (cognitive-bias mapping table), E (audit methodology for the 195,000-GETs claim), and F (selected reading), are published in full.

Appendix B (the Petri scenario list with substrate scores) and Appendix C (the complete hook source code) are excluded from public distribution for operational-security reasons. They are available on request to researchers and engagement partners; commercial use or redistribution requires a written agreement. Contact me directly: perry@support-forge.com.

What we would like to hear

If you are running Claude Code in any production-adjacent context — even on your own codebase, even as a solo developer — three questions are worth asking yourself after reading this paper:

1. *Would I notice if the agent confabulated a tool call right now?* 2. *Which of the six surface failure modes have I personally observed?* 3. *Which of the three tiers does my current setup have, and which does it lack?*

I would genuinely like to hear the answers. The framework in §6 is the best version we have on May 29, 2026. The best version we have at the end of summer will be better only if practitioners running real harnesses send back what is breaking and where the framework comes up short. There is a contact link at the bottom of the paper. Use it.

To Catalyst Acoustics, Dave Crandall, and Pam Marengo: thank you for the partnership. To everyone else reading: thank you for taking the question seriously enough to read 30 pages about it.

— Perry

Perry Bailes

Founder, Support Forge

I Built a Tool to Run Claude Code From My Phone. It Became a Product.

The origin story behind SessionForge — how a personal SSH dashboard built for one use case turned into a full session management platform for AI development teams.

Why Your AI Needs a Self-Improvement System (And What One Actually Looks Like)

AI tools are only as good as what they remember. If your system forgets every correction the moment a session ends, you're not building anything — you're just repeating yourself. Here's why self-improvement matters, what it can and can't do, and how we built one that runs every night.

We Built an AI That Learns From Its Own Mistakes — Every Night

Most AI tools forget everything the moment a session ends. We built a self-improvement pipeline that automatically captures corrections, validates them, and deploys better behavior every night at 2 AM. Here's how it works and what 4 months of data shows.

Subscribe to The Forge

Practical takes on AI, automation, and IT for small and mid-sized businesses. New posts, delivered when they're ready.

We'll only use your email to send The Forge. No spam, unsubscribe anytime.