Back to Blog
AI SystemsSelf-ImprovementClaude CodeAI ArchitectureAutomation

We Built an AI That Learns From Its Own Mistakes — Every Night

March 19, 20268 min read

The Problem With AI That Forgets

Every AI session starts cold. The corrections you made yesterday — the tone you asked it to change, the format it got wrong, the edge case it mishandled — are gone. You're back to square one.

For a consulting firm that relies on consistent AI behavior across dozens of clients and hundreds of sessions, this isn't a minor inconvenience. It's a fundamental reliability problem.

So we built something to fix it.

What We Built

The Support Forge AI Self-Improvement System is an automated pipeline that runs every night at 2 AM. It captures corrections from the day's sessions, proposes targeted improvements to our skill files, validates those improvements through 30-round testing, and deploys the ones that pass — all without human involvement.

Here's the six-phase pipeline:

P0 — Session Capture: A post-session hook fires automatically after each Claude Code session, capturing corrections and behavioral notes to correction-events.jsonl.

P1 — Pattern Analysis: The correction scanner reads the archive and classifies corrections by type, applying recency weighting (corrections from the last 30 days count 4× more than ones from 90 days ago).

P2 — Proposal Generation: Using the Anthropic API (haiku model for efficiency), the system generates targeted surgical diffs — minimal changes to skill files that address the identified patterns.

P3 — Nanochat Validation Gate: Each proposal runs through 30 API iterations, comparing before/after performance on a held-out test set (questions reserved specifically for validation, never used in training). Only proposals with a quality delta ≥ 5 advance.

P4 — Regression Testing: The skill-tester runs a full regression suite across all 49 skills and 264 agents, catching any regressions the proposal might introduce.

P5 — Deploy and Report: Passing improvements are committed to git with attribution tracking. A confidence score and nightly report are generated by 2:15 AM.

Four Months of Data

We launched the archive in December 2025 and the formal pipeline in March 2026. Here's what the numbers show:

  • 4,020 sessions captured in the archive
  • 40 corrections documented and processed
  • 8 promotions deployed (improvements that passed all gates)
  • 4 discards — proposals that scored below the quality delta threshold, protecting against regressions
  • 98.7% average skill score across all 49 measured skills
  • 83/100 system confidence — our composite metric that weights skill scores, volatility, and correction history

The delta gate has been critical. In one case, a proposed update to our deployment skill scored −48. Without the gate, that regression would have shipped. Instead, the correction weight was preserved and will inform the next cycle.

Two Skills Below 100%

The system currently flags two skills as volatile: database-migrations (67%) and incident-response-cyber (67%). These are skills that deal with high-variance, high-stakes scenarios where the right answer genuinely depends on context. The volatility isn't a bug — it's the system correctly identifying where human judgment should override automation.

What It Doesn't Do

This system doesn't retrain the base LLM. It doesn't guarantee perfect recall. It doesn't run in real time (nightly batch only, with immediate session capture as the exception). And it doesn't promote any change without a measurable quality improvement.

What it does do is systematically compound improvement. Each correction feeds future sessions. 40 corrections became 8 promotions, which measurably shifted average skill scores and system confidence.

Why This Matters for AI Consulting

If you're deploying AI for clients, consistency is the product. A client should get the same quality of output from your AI system on day 400 as they got on day 1 — actually better, because the system learned from 400 days of real usage.

The architecture we built for Support Forge is model-agnostic. Any LLM with an API can be plugged in as the evaluator. Any skill file format can be validated. The pipeline concept is portable to any AI consulting practice that needs behavioral consistency at scale.

The Bigger Picture

We're at B3 — the fourth benchmark phase in our system's evolution:

  • B0 (Jun 2025): Native session adaptation only. No persistence.
  • B1 (Dec 2025): Archive online. Cross-session recall enabled for the first time.
  • B2 (Mar 2026): Overriding engineering principles locked in CLAUDE.md.
  • B3 (Mar 2026): Full automated nightly pipeline. 8 promotions in the first 3 days.

The next phase is human-in-the-loop review for volatile skills and cross-client behavioral consistency tracking.

---

If you're thinking about how to build reliable AI behavior at scale — whether for your own business or your clients — book a discovery call. This is the architecture work we do every day.

PB
Perry Bailes
Founder, Support Forge

Related Articles

Ready to Transform Your Business?

Let's discuss how AI and automation can help you work smarter.

Get Started