We Built an AI That Learns From Its Own Mistakes — Every Night
The Problem With AI That Forgets
Every AI session starts cold. The corrections you made yesterday — the tone you asked it to change, the format it got wrong, the edge case it mishandled — are gone. You're back to square one.
For a consulting firm that relies on consistent AI behavior across dozens of clients and hundreds of sessions, this isn't a minor inconvenience. It's a fundamental reliability problem.
So we built something to fix it.
What We Built
The Support Forge AI Self-Improvement System is an automated pipeline that runs every night at 2 AM. It captures corrections from the day's sessions, proposes targeted improvements to our skill files, validates those improvements through 30-round testing, and deploys the ones that pass — all without human involvement.
Here's the six-phase pipeline:
P0 — Session Capture: A post-session hook fires automatically after each Claude Code session, capturing corrections and behavioral notes to correction-events.jsonl.
P1 — Pattern Analysis: The correction scanner reads the archive and classifies corrections by type, applying recency weighting (corrections from the last 30 days count 4× more than ones from 90 days ago).
P2 — Proposal Generation: Using the Anthropic API (haiku model for efficiency), the system generates targeted surgical diffs — minimal changes to skill files that address the identified patterns.
P3 — Nanochat Validation Gate: Each proposal runs through 30 API iterations, comparing before/after performance on a held-out test set (questions reserved specifically for validation, never used in training). Only proposals with a quality delta ≥ 5 advance.
P4 — Regression Testing: The skill-tester runs a full regression suite across all 49 skills and 264 agents, catching any regressions the proposal might introduce.
P5 — Deploy and Report: Passing improvements are committed to git with attribution tracking. A confidence score and nightly report are generated by 2:15 AM.
Four Months of Data
We launched the archive in December 2025 and the formal pipeline in March 2026. Here's what the numbers show:
- 4,020 sessions captured in the archive
- 40 corrections documented and processed
- 8 promotions deployed (improvements that passed all gates)
- 4 discards — proposals that scored below the quality delta threshold, protecting against regressions
- 98.7% average skill score across all 49 measured skills
- 83/100 system confidence — our composite metric that weights skill scores, volatility, and correction history
The delta gate has been critical. In one case, a proposed update to our deployment skill scored −48. Without the gate, that regression would have shipped. Instead, the correction weight was preserved and will inform the next cycle.
Two Skills Below 100%
The system currently flags two skills as volatile: database-migrations (67%) and incident-response-cyber (67%). These are skills that deal with high-variance, high-stakes scenarios where the right answer genuinely depends on context. The volatility isn't a bug — it's the system correctly identifying where human judgment should override automation.
What It Doesn't Do
This system doesn't retrain the base LLM. It doesn't guarantee perfect recall. It doesn't run in real time (nightly batch only, with immediate session capture as the exception). And it doesn't promote any change without a measurable quality improvement.
What it does do is systematically compound improvement. Each correction feeds future sessions. 40 corrections became 8 promotions, which measurably shifted average skill scores and system confidence.
Why This Matters for AI Consulting
If you're deploying AI for clients, consistency is the product. A client should get the same quality of output from your AI system on day 400 as they got on day 1 — actually better, because the system learned from 400 days of real usage.
The architecture we built for Support Forge is model-agnostic. Any LLM with an API can be plugged in as the evaluator. Any skill file format can be validated. The pipeline concept is portable to any AI consulting practice that needs behavioral consistency at scale.
The Bigger Picture
We're at B3 — the fourth benchmark phase in our system's evolution:
- B0 (Jun 2025): Native session adaptation only. No persistence.
- B1 (Dec 2025): Archive online. Cross-session recall enabled for the first time.
- B2 (Mar 2026): Overriding engineering principles locked in CLAUDE.md.
- B3 (Mar 2026): Full automated nightly pipeline. 8 promotions in the first 3 days.
The next phase is human-in-the-loop review for volatile skills and cross-client behavioral consistency tracking.
---
If you're thinking about how to build reliable AI behavior at scale — whether for your own business or your clients — book a discovery call. This is the architecture work we do every day.
Related Articles
Why Your AI Needs a Self-Improvement System (And What One Actually Looks Like)
AI tools are only as good as what they remember. If your system forgets every correction the moment a session ends, you're not building anything — you're just repeating yourself. Here's why self-improvement matters, what it can and can't do, and how we built one that runs every night.
What Is Claude Code and Why Should Your Business Care?
You've probably heard of ChatGPT. Claude Code is different — and for business owners who aren't developers, it's worth understanding what it actually does and how it gets deployed to save time and money.
5 AI Automations Every Small Business Needs
Discover the essential AI automations that can save your small business hours every week. From customer service to invoicing, these tools are game-changers.