Skill Evaluation & Measurement Guide

How to evaluate, measure, and improve Claude Code skills with data. This is the detailed reference for Level 5: Measurement in the skill maturity framework.

`evals/evals.json` Format

{
  "skill_name": "create-pr",
  "version": "1.0.0",
  "created": "2025-01-15",
  "last_validated": null,
  "evals": [
    {
      "id": 1,
      "prompt": "Create a PR for the openregister feature branch",
      "expected_output": "PR targets development branch, has quality checks, proper format",
      "files": [],
      "expectations": [
        "targets development branch (not main)",
        "runs composer check:strict",
        "includes ## Summary and ## Test plan sections"
      ]
    }
  ],
  "trigger_tests": {
    "should_trigger": [
      "Create a pull request for the feature branch",
      "Open a PR from development to main",
      "Make a PR for my changes",
      "Submit this branch for review via PR",
      "Create a GitHub pull request",
      "PR this to development",
      "Open pull request for openregister branch",
      "Make a pull request for my new feature",
      "Create PR targeting the main branch",
      "Submit a pull request with these changes"
    ],
    "should_not_trigger": [
      "Can you review this code?",
      "What is the difference between git merge and rebase?",
      "How do I resolve a merge conflict?",
      "Show me the git log",
      "Commit my changes",
      "Push to the remote branch",
      "What branches are available?",
      "Help me write a commit message",
      "Show the diff for my changes",
      "Explain what a pull request is"
    ]
  }
}

After running evals, update last_validated with the run date to unlock L5 green circle status.

Using the Anthropic Skill Creator

The Anthropic Skill Creator automates running, grading, and improving evals as a Claude Code skill.

Installation

The Skill Creator lives at .claude/skills/skill-creator/ in the hydra repo. It's a vendored copy of anthropics/skills/skills/skill-creator/ with local modifications: the eval workspace lives inside each skill rather than as a sibling folder, and the long-form eval-runner protocol is extracted from SKILL.md into references/eval-runner.md to keep SKILL.md lean.

Keeping it up to date

Run bash .claude/skills/skill-creator/update.sh for a dry-run preview, then --apply to write changes. The script delegates to update_from_upstream.py, which performs a per-file 3-way merge between three trees:

base — upstream snapshot at the pin recorded in .upstream-version
theirs — upstream HEAD
ours — your current local files

For each file path it picks the right outcome (add, modify, delete, conflict, keep, unchanged) using git merge-file, so local edits, local additions (e.g. references/eval-runner.md), and locally-deleted files are all preserved without a separate patch file. Conflicts are listed in the dry-run report; running --apply alone leaves conflicting files untouched, while --apply --force-conflicts writes standard <<<<<<< markers into them for editor-driven resolution.

The previous local-mods.patch mechanism is gone — the merge replaces it. If you find that file in an old checkout, delete it. The new flow needs no manual patch maintenance: as long as .upstream-version accurately records the last applied commit, the merge is reproducible.

Why we deviate from upstream: Upstream Skill Creator writes eval results to <skill-name>-workspace/ as a sibling to the skill folder. We keep them at <skill-dir>/evals/workspace/iteration-N/ so eval artifacts stay adjacent to the skill they belong to.

Running evals step-by-step

Invoke: In a Claude Code session, ask Claude to evaluate the skill:

"Run evals on the test-app skill" or "Use the skill creator to evaluate and improve my X skill"

Claude picks up the skill-creator and guides the process. The skill-creator's evals/evals.json format uses evals[] with id, prompt, expected_output, and expectations. We adopted this format across all our skills.
What happens: Two parallel subagents run each eval:
- With-skill agent: runs the scenario with the skill active
- Baseline agent: runs the same scenario without the skill Results are saved to <skill-dir>/evals/workspace/iteration-N/eval-<name>/ (inside the skill folder, per our local convention).
Review results: The Skill Creator runs eval-viewer/generate_review.py and opens a browser tab (or generates a static HTML file with --static) with two tabs: Outputs (click through each eval, leave qualitative feedback) and Benchmark (pass rates, timing, tokens with-skill vs baseline).
Output files written to <skill-dir>/evals/workspace/iteration-N/eval-<name>/:
- grading.json — assertion pass/fail with evidence per expectation
- timing.json — token count and duration
- benchmark.json — aggregate stats across all evals (one level up, at iteration-N/)
- eval-review-iteration-N.html — static viewer (one level up, at evals/workspace/)
Update last_validated and baseline_score in evals.json after a successful run:
```
"last_validated": "2026-04-13",
"baseline_score": 0.67
```
last_validated unlocks the L5 green circle in the skill overview dashboard. baseline_score records the with-skill pass rate at the time of validation — used as a regression guardrail when re-running evals (see below).
Improve cycle: The Skill Creator's analyzer flags non-discriminating assertions, flaky evals, and skill improvement suggestions. Update SKILL.md and re-run as iteration-2 (and so on, in the same evals/workspace/ folder).

`baseline_score` — Regression Detection

Even running evals manually (no CI), baseline_score is useful: it's the with-skill pass rate from the most recent successful eval run, recorded in evals/evals.json next to last_validated. When you re-run evals later, compare the new pass rate against baseline_score:

New rate >= baseline_score — skill is stable or improving. Update baseline_score to the new rate (and bump last_validated).
New rate < baseline_score — regression. Investigate before updating either field. The skill change you just made may have broken something.

This gives you a paper trail of "this skill scored 0.67 on these expectations on this date" without needing CI infrastructure. The Skill Creator's own benchmark.json already produces the pass rate — you're just writing it back into evals.json as a durable marker.

Eval Workspace Layout

The workspace is bounded to 3 live iterations max. Older iterations are summarized into archived-iterations/ automatically at the end of every eval cycle (Step 6 in references/eval-runner.md).

<skill-dir>/
  evals/
    evals.json                          # eval definitions + trigger tests
    workspace/
      archived-iterations/              # sorts above iteration-N alphabetically
        INDEX.md                        # one-line-per-iteration summary table
        index.json                      # machine-readable equivalent
        iteration-1/
          summary.json                  # pass rates, tokens, durations, failed assertions
          benchmark.json                # preserved verbatim
          benchmark.md
          embedded_data.json            # extracted from HTML viewer; re-render on demand
          notable/                      # full eval dirs ONLY for regressions
            eval-some-name/...          # OR same_as.json pointer if dedup'd vs earlier archive
      iteration-5/                      # live (full data)
        eval-basic-usage/
          eval_metadata.json
          with_skill/run-1/{grading,timing}.json + outputs/
          without_skill/run-1/{grading,timing}.json + outputs/
        eval-edge-case/...
        benchmark.json
      iteration-6/...                   # live
      iteration-7/...                   # live (current)
      eval-review-iteration-7.html      # static viewer for current iteration

Rotation rules (run by python -m scripts.rotate_workspace <workspace> --keep 3 at end of each eval cycle):

Iterations beyond the keep window are archived into archived-iterations/iteration-N/ and removed from the live tree.
Iteration numbering keeps incrementing forever — iteration-8, iteration-9 — never reused.
"Notable" detection (which evals get full output preserved in notable/): a regression vs. the previous iteration (assertion that passed before now fails, OR with_skill pass_rate dropped). For iteration 1 the fallback is "any failed assertion".
Dedup across archives: if an earlier archived iteration already preserved the same eval with an identical with_skill failed-assertion signature, a new notable entry is replaced by a same_as.json pointer to that earlier iteration — no duplicated outputs.
The rendered HTML viewer is NOT preserved as HTML — only its EMBEDDED_DATA JSON is. To inspect an archived iteration in the browser, feed embedded_data.json back through eval-viewer/generate_review.py.

Typical archived iteration size: ~30-50 KB (vs. ~250 KB for a live one).

evals/evals.json Format​

Using the Anthropic Skill Creator​

Installation​

Keeping it up to date​

Running evals step-by-step​

baseline_score — Regression Detection​

Eval Workspace Layout​