← All commands
Analysis & Verification

test

Execute eval test suites that verify skills actually work when loaded by an agent. Define test cases in cases.yaml with prompts, expected outcomes, and graders. Track baselines to catch regressions after refresh.

Why it matters

A skill can have perfect metadata and pass every lint check, but still produce wrong code when an agent uses it. Test closes this gap by actually running prompts through an agent harness and grading the output — like integration tests for your skill files.

What it does

  • Discovers tests/ directories inside skill directories containing cases.yaml
  • Parses declarative test suites with trigger, outcome, style, and regression test types
  • Executes prompts through configurable agent harnesses (Claude Code CLI, generic shell)
  • Grades results with 7 built-in graders: file-exists, command, contains, not-contains, json-match, package-has, llm-rubric
  • Supports custom graders via dynamic module import
  • Runs multiple trials per test case with configurable pass thresholds and flaky test detection
  • Stores baselines for regression tracking across skill updates

Usage

npx skills-check test [dir] [options]

Options

FlagDescription
-s, --skill <name>Test a specific skill
-t, --type <type>Filter: trigger, outcome, style, or regression
--agent <name>Agent harness: claude-code or generic
--trials <n>Number of runs per test case
--dryPreview test plan without executing
--update-baselineSave results as new baseline
--ciCI mode with strict exit codes
-f, --format <type>Output: terminal or json

Examples

Run all tests

npx skills-check test

Test one skill

npx skills-check test -s ai-sdk-core

Outcome tests only

npx skills-check test --type outcome

Preview plan

npx skills-check test --dry

Update baseline

npx skills-check test --update-baseline

CI tip

Run test --ci after refresh to catch regressions. Use --update-baseline on main after verified changes so future PRs compare against the latest known-good results.