name: skill-creator description: Use when creating new Agent Skills, upgrading existing skills, running evals to test a skill, benchmarking skill performance, or optimizing a skill's description for better triggering accuracy. Guidelines for Gold Standard skill structures. tier: 2 version: 2.0
Skill Creator Guide
This skill provides the authoritative standard for creating and iteratively improving Agent Skills. It combines the Anthropic Skills Standard with our local architecture rules.
Core loop: Draft skill → Write test cases → Run evals (with-skill + baseline) → Review with user → Improve → Repeat.
Your job is to figure out where the user is in this process and help them progress. Maybe they want to create a skill from scratch — help narrow intent, write a draft, create tests, run them, iterate. Maybe they already have a draft — go straight to eval/iterate. Be flexible.
Red Flags (Anti-Rationalization)
STOP and READ THIS if you are thinking:
- "I'll skip the eval step, the skill looks fine" → WRONG. Run evals — untested skills fail silently in production.
- "I can write the whole skill without talking to the user" → WRONG. Capture Intent first — assumptions cause rewrites.
- "The description is descriptive enough" → WRONG. CSO triggers are mechanical. Follow the schema.
- "This skill is too simple for a script" → WRONG. If logic > 5 lines, text instructions fail 30% of the time. Use a script.
- "I'll skip the viewer and evaluate outputs myself" → WRONG. Generate the eval viewer BEFORE evaluating — get results in front of the human ASAP.
Purpose
Enable agents to create, test, and iteratively improve Agent Skills following the Gold Standard. This skill provides both the quality standards (what makes a good skill) and the workflow engine (how to iterate to get there).
Capabilities
- Create new skills from scratch with validated structure
- Run structured evals with baseline comparison (with-skill vs without-skill)
- Grade, benchmark, and review eval results via interactive viewer
- Iteratively improve skills based on user feedback
- Optimize skill descriptions for better triggering accuracy
- Package skills into distributable
.skillfiles - Adapt workflow to different environments (Claude Code, Codex, Antigravity, Claude.ai, Cowork)
1. Quality Standards (Summary)
Full details:
references/writing_skills_best_practices_anthropic.mdDesign patterns:references/skill_design_patterns.md
Skill Anatomy
Every skill follows this directory structure:
skill-name/
├── SKILL.md (Required — YAML frontmatter + markdown body)
├── examples/ # Few-shot training: input/output pairs
├── assets/ # User output: templates, files for output
├── references/ # Agent knowledge: specs, guidelines, schemas
├── scripts/ # Executable logic: Python/Bash tools
└── eval-viewer/ # (Optional) Interactive eval review UI
All subdirectories are optional. Do NOT create README.md, CHANGELOG.md, or other aux docs inside the skill folder — all instructions go in SKILL.md.
Key Rules
- Script-First: If a step requires >5 lines of if/then/else logic, it MUST be a Python script in
scripts/— agents are unreliable at executing complex logic from text. - 12-Line Rule: Inline code blocks, templates, or examples >12 lines MUST be extracted to
examples/,assets/, orreferences/. - Graduated Language: Skills must work across LLMs (Claude, Gemini, Codex, Qwen, Llama). Use graduated instruction strength:
- Safety-critical:
MUST/ALWAYS+ explain why - Behavioral: Explain why + imperative verb
- Prohibited:
MUST NOT+ consequence
- Safety-critical:
- CSO (Search Optimization): The
descriptionfield determines if a skill is loaded. Start withUse when...(preferred),Guidelines for...,Helps with...,Standards for..., orDefines.... Keep under 50 words. Make descriptions "pushy" to prevent under-triggering. - Red Flags: Every skill MUST include a "Red Flags" section to prevent agent rationalization.
- Naming: Use gerund form
verb-ing-noun(e.g.,processing-pdfs). Always lowercase kebab-case.
Frontmatter
---
name: skill-my-capability
description: "Use when..."
tier: [0|1|2]
version: 1.0
---
Run python3 scripts/init_skill.py --help to see available Tiers. Do NOT guess tiers manually.
Required Sections in SKILL.md
- Purpose — the "Why"
- Red Flags — "Stop and Rethink" triggers
- Capabilities — bulleted list
- Execution Mode —
prompt-first,script-first, orhybrid - Script Contract — required for script-first/hybrid (command, inputs, outputs, exit codes)
- Safety Boundaries — explicit scope and exclusions
- Validation Evidence — objective verification output
- Instructions — step-by-step algorithms
- Examples — input/output pairs (see
examples/SKILL_EXAMPLE_LEGACY_MIGRATOR.md)
Template: assets/SKILL_TEMPLATE.md
Execution Mode
- Mode:
hybrid - Rationale: Skill authoring needs judgement for structure/quality decisions and deterministic scripts for repeatable validation and generation.
Script Contract
- Primary Commands:
python3 scripts/init_skill.py <name> --tier <N>— generate skill skeletonpython3 scripts/validate_skill.py <skill-path>— validate structure and compliancepython3 scripts/package_skill.py <skill-path>— package into.skillfile
- Inputs: skill name/path, tier, and local policy config.
- Outputs: generated skeletons, validation pass/fail, warnings,
.skillarchives. - Failure Semantics: non-zero exit code on validation errors.
Safety Boundaries
- Operate only on explicit target skill directories.
- No broad or implicit repo-wide mutation.
- Destructive actions never default; require explicit user intent.
Validation Evidence
validate_skill.pyoutput and generated diffs.- Quality checks: section coverage, CSO, inline efficiency, metadata.
2. Creating a Skill
Capture Intent
Start by understanding the user's intent. The current conversation might already contain a workflow they want to capture. If so, extract answers from conversation history first — tools used, sequence of steps, corrections made, I/O formats observed.
- What should this skill enable the agent to do?
- When should this skill trigger? (what user phrases/contexts)
- What's the expected output format?
- Should we set up test cases? Skills with objectively verifiable outputs (file transforms, data extraction, code generation) benefit from tests. Subjective skills (writing style, art) often don't. Suggest the appropriate default, let the user decide.
Interview and Research
Proactively ask about edge cases, I/O formats, example files, success criteria, dependencies. Wait to write test prompts until this is ironed out.
Check available MCPs — if useful for research, research in parallel via subagents if available, otherwise inline.
Write the SKILL.md
- Check Duplicates: Verify in your Skill Catalog.
- Initialize:
python3 scripts/init_skill.py my-new-skill --tier 2 - Populate: Edit the auto-generated
SKILL.md. Fill in Red Flags, description, Execution Mode, Script Contract, Safety Boundaries, Validation Evidence. Consultreferences/skill_design_patterns.mdandreferences/writing_skills_best_practices_anthropic.md. - Cleanup: Remove unused placeholder files/directories created by init.
- Validate:
python3 scripts/validate_skill.py ../my-new-skill
Test Cases
After the draft, write 2-3 realistic test prompts — things a real user would actually say. Share them with the user for confirmation, then run them.
Save to evals/evals.json. Don't write assertions yet — just prompts. You'll draft assertions while runs are in progress.
{"skill_name": "my-skill", "evals": [
{"id": 1, "prompt": "Realistic user prompt", "files": [],
"expectations": ["Verifiable outcome 1", "Verifiable outcome 2"]}
]}
See references/eval_schemas.md for the full schema.
3. Running and Evaluating Test Cases
This section is one continuous sequence — don't stop partway through.
Put results in <skill-name>-workspace/ as a sibling to the skill directory. Organize by iteration (iteration-1/, iteration-2/) and within that, each test case gets a directory.
Step 1: Spawn all runs (with-skill AND baseline) in the same turn
For each test case, spawn two subagents in the same turn — one with the skill, one without. Launch everything at once so it all finishes around the same time.
With-skill run:
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
Baseline run (depends on context):
- New skill: no skill at all → save to
without_skill/outputs/ - Improving existing skill: snapshot the old version first (
cp -r), point baseline at snapshot → save toold_skill/outputs/
Write eval_metadata.json for each test case with a descriptive name.
Step 2: While runs are in progress, draft assertions
Don't wait — use this time to draft quantitative assertions. Good assertions are objectively verifiable with descriptive names. Don't force assertions onto subjective outputs.
Update eval_metadata.json and evals/evals.json with assertions.
Step 3: As runs complete, capture timing data
When each subagent completes, save total_tokens and duration_ms to timing.json in the run directory. This is the only opportunity to capture this data.
Step 4: Grade, aggregate, and launch the viewer
-
Grade each run — spawn a grader subagent reading
agents/grader.md. Savegrading.jsonwith fieldstext,passed,evidence. For programmatic assertions, write and run a script. -
Aggregate into benchmark:
python3 scripts/aggregate_benchmark.py <workspace>/iteration-N --skill-name <name>Produces
benchmark.jsonandbenchmark.md. -
Analyst pass — read benchmark data, surface patterns (see
agents/analyzer.md). -
Launch the viewer:
nohup python3 eval-viewer/generate_review.py \ <workspace>/iteration-N \ --skill-name "my-skill" \ --benchmark <workspace>/iteration-N/benchmark.json \ > /dev/null 2>&1 & VIEWER_PID=$!For iteration 2+, add
--previous-workspace <workspace>/iteration-<N-1>. -
Tell the user: "I've opened the results in your browser. 'Outputs' tab shows test cases with feedback boxes. 'Benchmark' tab shows quantitative comparison. Come back when you're done."
Step 5: Read the feedback
When the user is done, read feedback.json. Empty feedback = looks good. Focus improvements on test cases with specific complaints.
kill $VIEWER_PID 2>/dev/null
4. Improving the Skill
How to think about improvements
-
Generalize from feedback. You're iterating on a few examples to create a skill used many times. Rather than overfitty changes or oppressively constrictive MUSTs, try different metaphors or recommend different patterns.
-
Keep the prompt lean. Remove things not pulling their weight. Read transcripts — if the skill wastes time on unproductive steps, trim those instructions.
-
Explain the why. Today's LLMs are smart. When given good harness they go beyond rote instructions. If you find yourself writing ALWAYS/NEVER in all caps, reframe with reasoning. That's more powerful and effective.
-
Look for repeated work across test cases. Read transcripts — if all subagents independently wrote similar helper scripts, that script should be bundled in
scripts/.
The iteration loop
- Apply improvements
- Rerun all test cases into
iteration-<N+1>/, including baselines - Launch viewer with
--previous-workspace - Wait for user review
- Read feedback, improve, repeat
Stop when: user is happy, feedback is all empty, or no meaningful progress.
5. Description Optimization
After the skill is working well, optimize the description for better triggering accuracy.
Step 1: Generate trigger eval queries
Create ~20 eval queries — mix of should-trigger and should-not-trigger. Save as JSON:
[{"query": "realistic user prompt with details", "should_trigger": true}]
Queries must be realistic — include file paths, personal context, abbreviations, typos, casual speech. Focus on edge cases: near-misses that share keywords but need different skills.
Step 2: Review with user
Present eval set using the HTML template:
- Read
assets/eval_review.html - Replace
__EVAL_DATA_PLACEHOLDER__,__SKILL_NAME_PLACEHOLDER__,__SKILL_DESCRIPTION_PLACEHOLDER__ - Write to temp file and open it
- User edits queries, exports as
eval_set.json
Step 3: Run the optimization loop
Dependencies: Requires
pip install anthropicandclaudeCLI installed. Claude Code specific — see Section 6 for other environments.
python3 scripts/run_loop.py \
--eval-set <path-to-eval-set.json> \
--skill-path <path-to-skill> \
--model <model-id-powering-this-session> \
--max-iterations 5 --verbose
This splits eval set into 60% train / 40% test, runs each query 3 times, uses Claude with extended thinking to propose improvements, iterates up to 5 times, selects by test score to avoid overfitting.
Step 4: Apply the result
Take best_description from output, update the skill's frontmatter. Show before/after and scores.
6. Environment Adaptations
Claude Code (full workflow)
All features work: subagents for parallel test runs, viewer via browser, claude -p for description optimization, package_skill.py for packaging.
Codex
Codex supports subagents. Adapt:
- Description optimization:
run_loop.pyusesclaude -pwhich is Claude Code specific. Skip this step or run it manually. - Viewer: If no browser, use
--static <output_path>for standalone HTML.
Antigravity IDE
Full workflow available. Follow the standard process. If subagents are available, use parallel test runs. Otherwise, run tests sequentially.
Claude.ai (no subagents)
- Running tests: No subagents. Read the skill's SKILL.md, then follow its instructions yourself, one test at a time. Skip baseline runs.
- Reviewing: Skip browser viewer. Present results directly in conversation. Save output files and tell user where they are.
- Benchmarking: Skip quantitative benchmarking.
- Description optimization: Requires
claudeCLI. Skip on Claude.ai. - Packaging:
package_skill.pyworks anywhere with Python.
Cowork (headless)
- Subagents work. If timeouts occur, run tests sequentially.
- No browser — use
--static <output_path>for viewer. User clicks link to open HTML. - Feedback: "Submit All Reviews" downloads
feedback.jsonas file. - GENERATE THE EVAL VIEWER BEFORE evaluating outputs yourself — get results in front of the human ASAP.
7. Advanced: Blind Comparison
For rigorous A/B comparison between skill versions, read agents/comparator.md and agents/analyzer.md. Give two outputs to an independent agent without telling it which is which, let it judge quality.
Optional. Most users won't need it — the human review loop is usually sufficient.
8. Package and Present
If the present_files tool is available:
python3 scripts/package_skill.py <path/to/skill-folder>
Direct the user to the resulting .skill file.
9. Scripts & Tools Reference
scripts/ — Core automation
init_skill.py: Generate compliant skill skeletonvalidate_skill.py: Enforce structure, frontmatter, CSO, execution-policyskill_utils.py: Config loader (defaults + project overlay) +parse_skill_md()aggregate_benchmark.py: Compute benchmark summary fromgrading.jsonfilesgenerate_report.py: Build static HTML report frombenchmark.jsonrun_eval.py: Run trigger evaluation queries via CLIrun_loop.py: Main eval + improvement loop for description optimizationimprove_description.py: Improve description using Claude with extended thinkingpackage_skill.py: Package skill into.skill(ZIP) file
eval-viewer/ — Interactive eval review
generate_review.py: Launch interactive eval viewer (servesviewer.html)viewer.html: Benchmark viewer with Outputs + Benchmark tabs
10. Resources Reference
references/writing_skills_best_practices_anthropic.md— Full authoring guide (Anthropic standard)references/skill_design_patterns.md— Degrees of Freedom, Progressive Disclosure, EDDreferences/default_parameters.md— Config defaults and resolution orderreferences/eval_schemas.md— JSON schemas for evals, grading, comparison, analysisreferences/output-patterns.md— Templates for agent output formatsreferences/workflows.md— Designing skill-internal workflowsreferences/persuasion-principles.md— Psychological principles for instructionsreferences/testing-skills-with-subagents.md— TDD methodology for skillsagents/grader.md— Evaluate assertions against outputsagents/comparator.md— Blind A/B comparisonagents/analyzer.md— Post-hoc analysis of results

