Skill Evolve

Automated improvement of bento skills and agent prompts using session history as training signal.

Inspired by Meta-Harness (Stanford IRIS Lab) — which searches over model harnesses by proposing candidates, benchmarking them, and tracking a Pareto frontier. We apply the same pattern to our own skills and agent definitions, using real session logs as the evaluation dataset.

The Idea

Skills (SKILL.md) and agent souls (SOUL.md) are hand-written today. They encode operational knowledge — how to review a repo, how to do a PR review, how to structure context for a project. But they're static. They don't learn from whether sessions using them actually went well.

Skill-evolve closes this loop:

Sessions happen → logs accumulate → evaluate outcomes →
propose skill improvements → validate offline → promote winners

The "dataset" is session JSONL files. The "harness" is skill/agent definitions. The "evaluator" is whether improved skills would have produced better outcomes on past sessions. The "proposer" is a Claude Code session that reads failure patterns and drafts improvements.

Architecture

┌─────────────────────────────────────────────────┐
│                  Cron (weekly)                   │
│                                                  │
│  1. Collect    session logs from ~/.pi/sessions  │
│  2. Evaluate   score sessions by outcome signal  │
│  3. Analyze    identify failure patterns          │
│  4. Propose    generate skill candidates          │
│  5. Validate   test candidates against held-out   │
│  6. Promote    update frontier, notify            │
└─────────────────────────────────────────────────┘
         │                          │
         ▼                          ▼
   evolution_summary.jsonl    frontier_skills.json

Step 1: Collect

Gather session logs from the past period. Each session JSONL contains the full conversation: user messages, assistant responses, tool calls, tool results, errors. Filter to sessions that used a specific skill or agent.

Step 2: Evaluate

Score each session. The outcome signal could be:

Completion — did the session reach its goal? (heuristic: did the user say "thanks", approve a PR, or move on to a new topic without frustration?)
Efficiency — tool call count, token usage, number of correction cycles ("no not that", "try again")
Error rate — how many tool calls failed, how many retries
User corrections — explicit feedback like "don't do X" or "that's wrong"

This is the hardest part. Unlike meta-harness's text classification accuracy or terminal-bench pass rate, our signal is noisy and subjective. Early versions should use simple heuristics; later versions can use an LLM-as-judge.

Step 3: Analyze

A proposer session reads:

evolution_summary.jsonl — what skill variants have been tried, what worked
frontier_skills.json — current best skill versions
Session logs with low scores — what went wrong?
Session logs with high scores — what patterns should be preserved?

The proposer identifies recurring failure modes. Examples:

"The review-repo skill doesn't tell the agent to check for monorepo workspace configs, leading to missed context in 4/12 sessions"
"The reviewer agent keeps suggesting changes to generated files because the soul doesn't mention ignoring dist/"
"Project context injection is missing repo branch conventions, causing agents to push to wrong branches"

Step 4: Propose

The proposer generates skill candidates — concrete diffs to SKILL.md or SOUL.md files. Each candidate has:

{
  "name": "review-repo-v3",
  "base": "skills/review-repo/SKILL.md",
  "hypothesis": "Adding monorepo detection step will reduce missed-context errors",
  "diff_summary": "Added workspace detection step between structure and architecture sections",
  "file": "candidates/review-repo-v3/SKILL.md"
}

Candidates are stored in a staging area, never applied directly.

Step 5: Validate

Replay past sessions against candidates. For each candidate skill:

Take N held-out sessions that used the base skill
Run them through a simulated evaluation: "given this session's initial request and the candidate skill, would the outcome improve?"
This can be an LLM-as-judge comparison: show both the original session trace and what the new skill would have produced, ask which is better

This is offline validation — no real sessions are affected.

Step 6: Promote

If a candidate beats the current frontier on held-out sessions:

Update frontier_skills.json
Append to evolution_summary.jsonl
Notify via Telegram/Slack: "Skill review-repo updated: added monorepo detection (improved on 3/5 held-out sessions)"
The actual skill file is NOT auto-updated — the human reviews and applies

State Files

Following meta-harness conventions:

`evolution_summary.jsonl`

One line per evaluated candidate:

{
  "iteration": 3,
  "skill": "review-repo-v3",
  "base_skill": "skills/review-repo/SKILL.md",
  "hypothesis": "Adding monorepo detection reduces missed-context errors",
  "score": 0.73,
  "delta": 0.12,
  "sessions_evaluated": 8,
  "outcome": "0.73 (+0.12)",
  "timestamp": "2026-05-02T10:00:00Z"
}

`frontier_skills.json`

Current best version of each skill:

{
  "review-repo": {
    "version": "v3",
    "score": 0.73,
    "candidate_path": "candidates/review-repo-v3/SKILL.md",
    "promoted_at": "2026-05-02T10:00:00Z"
  },
  "reviewer": {
    "version": "v1",
    "score": 0.61,
    "candidate_path": null
  }
}

What Gets Evolved

Artifact	Location	What changes
Skills	`skills/*/SKILL.md`	Workflow steps, anti-patterns, output format
Agent souls	`agents/*/SOUL.md`	Focus areas, tone, review criteria
Project context	`~/.projects/*/project.json`	Workflow guidelines, injected instructions
System prompts	Extension-injected prompts	Context framing, constraints

Evaluation Signals

The quality of skill-evolve depends entirely on the evaluation signal. Possible approaches, from simple to sophisticated:

Tier 1: Heuristic (start here)

Session length vs task complexity (shorter is better for simple tasks)
Number of user corrections / "no" / "try again" messages
Tool error rate
Whether the session ended with apparent success

Tier 2: LLM-as-Judge

Show a judge model the session transcript and ask: "Rate this session 1-5 on task completion, efficiency, and user satisfaction"
Compare two sessions side-by-side: "Which skill produced a better outcome?"

Tier 3: Outcome-linked

If using Linear: did the linked issue get closed?
If doing PR review: was the review accepted without revisions?
If doing code changes: did CI pass on the first push?

Constraints

Human in the loop. Candidates are proposed but never auto-applied. The user reviews and promotes.
Anti-overfitting. Skills must remain general-purpose. No session-specific patches. The same anti-overfitting rules from Meta-Harness apply: no hardcoded knowledge about specific repos, tasks, or users.
Held-out split. Always evaluate on sessions the proposer hasn't seen. Otherwise you're just memorizing failure modes.
Slow cadence. Weekly or biweekly. Skills shouldn't churn — users need stability. A skill that changes every day is worse than one that's slightly suboptimal.

Dependencies

Before building skill-evolve, we need:

runtime_wrapper — programmatic Claude Code / pi invocation with structured logging (adapted from meta-harness claude_wrapper.py)
Session scoring — even a basic heuristic scorer for session JSONL files
Candidate staging — a directory structure for skill variants that doesn't interfere with live skills

Future: Continuous Learning

The end state is a closed loop where bento gets better at its job automatically:

User works with pi → sessions logged → skill-evolve runs weekly →
better skills proposed → human reviews → skills updated →
next week's sessions are better → repeat

This is Layer 3 of the vision (Company Brain) applied to bento itself. The system doesn't just store knowledge — it improves its own ability to apply knowledge.

References

Meta-Harness paper — the framework this is based on
Meta-Harness repo — reference implementation
VISION.md — how this fits into the broader bento architecture