▷
Analysis & Verification
test
Execute eval test suites that verify skills actually work when loaded by an agent. Define test cases in cases.yaml with prompts, expected outcomes, and graders. Track baselines to catch regressions after refresh.
Why it matters
A skill can have perfect metadata and pass every lint check, but still produce wrong code when an agent uses it. Test closes this gap by actually running prompts through an agent harness and grading the output — like integration tests for your skill files.
What it does
- Discovers tests/ directories inside skill directories containing cases.yaml
- Parses declarative test suites with trigger, outcome, style, and regression test types
- Executes prompts through configurable agent harnesses (Claude Code CLI, generic shell)
- Grades results with 7 built-in graders: file-exists, command, contains, not-contains, json-match, package-has, llm-rubric
- Supports custom graders via dynamic module import
- Runs multiple trials per test case with configurable pass thresholds and flaky test detection
- Stores baselines for regression tracking across skill updates
Usage
npx skills-check test [dir] [options]Options
| Flag | Description |
|---|---|
-s, --skill <name> | Test a specific skill |
-t, --type <type> | Filter: trigger, outcome, style, or regression |
--agent <name> | Agent harness: claude-code or generic |
--trials <n> | Number of runs per test case |
--dry | Preview test plan without executing |
--update-baseline | Save results as new baseline |
--ci | CI mode with strict exit codes |
-f, --format <type> | Output: terminal or json |
Examples
Run all tests
npx skills-check testTest one skill
npx skills-check test -s ai-sdk-coreOutcome tests only
npx skills-check test --type outcomePreview plan
npx skills-check test --dryUpdate baseline
npx skills-check test --update-baselineCI tip
Run test --ci after refresh to catch regressions. Use --update-baseline on main after verified changes so future PRs compare against the latest known-good results.