Agent & Skill Quality Audit

Executive Summary

This audit scores 25 agent and skill definitions across 6 dimensions using the agent-auditor rubric. Two definitions earned A grades (code-reviewer, auto-error-resolver), thirteen scored B, six scored C, and none scored D. Four critical issues were identified: a runtime-breaking typo in otel-quality-reporting, a security gap in git-commit-smart, a 100% error rate on agent-auditor, and zero telemetry for two OTEL skills.

Grade scale: A = 48–60 | B = 36–47 | C = 24–35 | D = <24

1. Active Agents Scorecard

Agent	Telemetry	Definition	Prompting	Overlap	Alignment	Efficiency	Total	Grade
code-reviewer	9	9	9	8	7	8	50	A
auto-error-resolver	7	10	10	8	10	6	49	A
genai-quality-monitor	9	10	10	5	4	9	47	B
webscraping-research-analyst	5	10	10	5	9	7	46	B
code-simplifier	4	10	10	5	8	8	45	B
agent-auditor	1	8	9	9	0	2	29	C

2. Skill-Embedded Agents

Note: Skill-embedded agents score 0/10 on Telemetry, Alignment, and Efficiency (no separate spans emitted). Adjusted grade uses deterministic dimensions only (/30 scale: A-eq = 24–30, B-eq = 18–23, C-eq = 12–17).

Agent	Definition	Prompting	Overlap	Det /30	Adj Grade
html-auditor	9	8	5	22	A-eq
code-reviewer (skill)	9	5.5	5	19.5	B-eq
code-explorer (skill)	9	5	5	19	B-eq
code-architect (skill)	9	5	5	19	B-eq

3. Skills Scorecard

Skill	Telemetry	Definition	Prompting	Overlap	Alignment	Efficiency	Total	Grade
content-creator	8	9	8	8	7	7	47	B
otel-session-summary	7	9	9	6	7	8	46	B
session-backlog	4	8	9	8	7	8	44	B
otel-quality-reporting	8	8	9	5	8	5	43	B
backlog-implementer	4	7	9	7	7	7	41	B
bug-detective	7	6	8	7	7	6	41	B
git-commit-smart	7	5	4	8	8	7	39	B
frontend-design	7	4	6	8	8	5	38	B
ralph-wiggum	8	5	5	8	6	5	37	B
cicd-cross-platform	6	6	6	8	5	5	36	B
otel-improvement	0	10	10	7	0	8	35	C
codebase-analyzer	5	5	6	7	6	6	35	C
feature-dev	6	5	6	6	5	6	34	C
agent-improvement	1	9	8	7	1	7	33	C
otel-output-provenance	0	9	9	6	0	7	31	C

4. Grade Distribution

Grade	Count	Definitions
A	2	code-reviewer (50), auto-error-resolver (49)
B	13	genai-quality-monitor (47), content-creator (47), webscraping-research-analyst (46), otel-session-summary (46), code-simplifier (45), session-backlog (44), otel-quality-reporting (43), backlog-implementer (41), bug-detective (41), git-commit-smart (39), frontend-design (38), ralph-wiggum (37), cicd-cross-platform (36)
C	6	otel-improvement (35), codebase-analyzer (35), feature-dev (34), agent-improvement (33), otel-output-provenance (31), agent-auditor (29)
D	0	—

Skill-Embedded Agents (Adjusted /30 Scale)

Adj Grade	Count	Definitions
A-eq	1	html-auditor (22/30)
B-eq	3	code-reviewer-skill (19.5/30), code-explorer (19/30), code-architect (19/30)

5. Scoring Dimensions Reference

#	Dimension	Max	Method
1	Telemetry Health	10	Usage frequency (35%), error rate (25%), trend (20%), session diversity (20%) from OTEL traces
2	Definition Quality	10	10-point checklist: name, description, tools, model, description length, ## sections, ## When, output format, line count 30–200, tools restricted
3	Prompt Engineering	10	Role statement (1), numbered steps (1.5), guardrails (1), examples (1), tables (1), output spec (1.5), markdown structure (1), scope boundaries (1)
4	Overlap & Redundancy	10	Jaccard similarity on tool sets and description keywords; flag if tool-Jaccard >0.8 AND keyword-Jaccard >0.5
5	Usage Alignment	10	Match expected category from description vs actual agent.category / plugin.category in OTEL
6	Efficiency & Cost	10	Duration percentile (30%), retry/error amplification (25%), output density (25%), background appropriateness (20%)

6. Cross-Cutting Findings

Universal Gaps

Missing model field: 0/15 skills specify a model in frontmatter — all should add one
Missing allowed-tools: 7 skills lack tool restrictions (feature-dev, bug-detective, cicd-cross-platform, ralph-wiggum, codebase-analyzer, frontend-design, backlog-implementer)
No role statements: 0/15 skills open with “You are…” — all use functional/descriptive openings

Critical Issues

P0 Runtime failure: otel-quality-reporting has a typo in Phase 1–2 script paths (otel-quality-reportinging — double “ing”) that causes FileNotFoundError.

P0 Security risk: git-commit-smart uses git add -A with no guardrail against staging .env, credentials, or secrets.

P1 100% error rate: agent-auditor has 100% error rate on all 5 observed spans (Feb 21); 307 lines exceeds its own 200-line guideline.

P1 Zero telemetry: otel-improvement and otel-output-provenance have 0 plugin invocations despite references in traces — likely invocation mechanism mismatch.

Structural Observations

Telemetry blind spot: Skill-embedded agents (feature-dev/agents/*, otel-improvement/agents/*) emit no separate spans; 3/6 scoring dimensions are structurally unavailable for them.
Overlap cluster: otel-session-summary / otel-quality-reporting / otel-output-provenance share tools and domain; ## When sections need clearer disambiguation.
Pattern mismatch: cicd-cross-platform functions as injected reference context (280 keyword matches) rather than an interactive skill (6 explicit invocations) — should be documented accordingly.
Declining trends: bug-detective, ralph-wiggum, content-creator, and feature-dev show declining usage in last 7 days.

7. Priority Action Items

Priority	Action	Target	Impact
P0	Fix `otel-quality-reportinging` script path typo	otel-quality-reporting	Runtime failures in Phase 1–2
P0	Add sensitive-file guardrail (never stage .env/credentials)	git-commit-smart	Security risk from `git add -A`
P1	Add `model` to all 15 skills	All skills	Routing reliability, cost control
P1	Add `allowed-tools` to 7 skills missing it	feature-dev, bug-detective, cicd-cross-platform, ralph-wiggum, codebase-analyzer, frontend-design, backlog-implementer	Tool scope control
P1	Investigate agent-auditor 100% error rate	agent-auditor	Reliability
P1	Trim agent-auditor to <200 lines	agent-auditor	Self-consistency
P2	Investigate zero telemetry for otel-improvement, otel-output-provenance	Invocation mechanism	Observability gap
P2	Add `## When` sections to skills missing them	session-backlog, frontend-design, git-commit-smart	Auto-routing accuracy
P2	Add `## When` routing to all 4 skill-embedded agents	feature-dev/agents/, otel-improvement/agents/	Routing reliability
P3	Add role statements (“You are…”) to all skills	All skills	Prompt engineering consistency
P3	Add numbered workflow steps to ralph-wiggum	ralph-wiggum	Operational clarity
P3	Disambiguate otel-session-summary vs otel-quality-reporting in `## When`	Both skills	Reduce routing confusion

8. A-Grade Agents

code-reviewer — A (50/60)

                    Strengths: Highest usage (322 invocations/Feb, 20 active days), clean read-only tool set, stable week-over-week trend, near-zero error rate (4%).
                

Issues: Description too terse for reliable auto-routing (“Expert code reviewer. Use proactively after code changes.” — lacks TypeScript/React/Node.js signal); category split between review and code.
Action: Expand description to include stack keywords.

auto-error-resolver — A (49/60)

                    Strengths: Perfect definition quality (10/10) and prompt engineering (10/10), 100% category alignment (error-handling), rising usage (4x week-over-week).
                

Issues: 7.1% error rate in the 5–15% penalty band (3 errors on Feb 17); session diversity borderline at ~3 sessions.
Action: Investigate Feb 17 error cluster.

9. B-Grade Agents

genai-quality-monitor — B (47/60)

Strengths: Rising usage (2x week-over-week), zero errors, domain-specific evaluation tables, Analysis Commands section with real bash queries.
Issues: Categorized as general 67% of the time instead of observability — 1 point from A-grade.
Action: Add “OTEL” or “observability” to description field for stronger categorizer signal.

webscraping-research-analyst — B (46/60)

Strengths: Best output density (14,728 bytes), excellent session diversity (7 sessions), perfect category alignment.
Issues: 25% error rate on instrumented spans (Feb 17 error with has_actions:true); declining trend.
Action: Investigate Feb 17 error (session 40dc9c66); check if agent crossed into implementation despite guardrails.

code-simplifier — B (45/60)

Strengths: Perfect definition and prompting (10/10 each), zero errors, well-scoped Simplification Patterns table.
Issues: Near-zero usage (1 confirmed invocation in Jan); high tool Jaccard (0.857) with lazy code-refactor-agent.
Action: Update description to include routing keywords like “cleanup” or “readability.”

10. C-Grade Agent

agent-auditor — C (29/60)

Critical: 100% error rate on all 5 observed invocations; 307 lines exceeds 200-line guideline; miscategorized as security instead of observability; zero usage before Feb 21.

Strengths: Exceptional prompt engineering (9/10) with phases, tables, guardrails, and output templates; appropriate background usage.
Action: Trim to <200 lines; fix category routing; investigate error root cause.

11. B-Grade Skills

content-creator — B (47/60)

Strengths: Highest telemetry among skills (114 spans), comprehensive frontmatter with MCP tools, excellent voice/tone templates.
Issues: Missing model; declining trend (4 spans last 7d vs 46 prior).
Action: Add model: claude-sonnet-4-6; investigate usage decline.

otel-session-summary — B (46/60)

Strengths: Tight 3-tool footprint (Bash, Read, Glob), leanest complete skill, rising usage, clear scope boundaries.
Issues: Missing model; overlap with otel-quality-reporting on LLM-as-Judge metrics.
Action: Add model: claude-haiku-4-5; add disambiguation note to ## When section.

session-backlog — B (44/60)

Strengths: Excellent Rules section, clear deduplication logic, well-specified output format.
Issues: Missing model; missing ## When routing section; no role statement.
Action: Add model and ## When to Use.

otel-quality-reporting — B (43/60)

P0 Critical typo in script paths (otel-quality-reportinging) causes runtime failure.

Strengths: Most complete output specification, quality checklist for self-verification, well-defined publish pipeline.
Issues: Description 274 chars (over 200 limit); missing model; overlap with otel-session-summary.
Action: Fix script path typo immediately; trim description; add model: claude-sonnet-4-6.

backlog-implementer — B (41/60)

Strengths: Best workflow detail (8 sub-steps in Phase 2), unique memory compaction protocol, code review integration per commit.
Issues: Missing allowed-tools and model; description slightly over 200 chars.
Action: Add allowed-tools and model: claude-sonnet-4-6; trim description.

bug-detective — B (41/60)

Strengths: Best DO/DON’T examples, P0–P3 priority matrix, time allocation table, Quick Start entry point.
Issues: Missing allowed-tools, model, output section; declining usage; 277 lines near upper bound.
Action: Add frontmatter fields; promote “When to use” to ## When to Use heading.

git-commit-smart — B (39/60)

P0 No guardrail against staging secrets with git add -A. Only 28 lines (below 30-line minimum).

Strengths: Most-used skill (tied at 18 invocations), minimal tool list, clear commit format spec.
Action: Add sensitive-file guardrail; expand to 30+ lines; add model and constraints section.

frontend-design — B (38/60)

Strengths: High usage (18 invocations, rising), excellent guardrails against generic AI aesthetics.
Issues: Missing allowed-tools, model, ## When; description 215 chars (over limit); no workflow phases.
Action: Add light workflow (2–3 phases); add ## When to Use; trim description.

ralph-wiggum — B (37/60)

Strengths: Unique Stop-hook feedback loop pattern; clear Good for/Not good for guidance.
Issues: Lowest prompt engineering among B-grade (5/10); no role statement, workflow steps, tables, or output spec.
Action: Add role statement, numbered iteration steps, output format, allowed-tools, and model.

cicd-cross-platform — B (36/60)

Strengths: High-quality code examples; recommendation matrix; wide keyword coverage (280 ambient matches).
Issues: Reference document, not workflow skill; no role statement, workflow, allowed-tools, model, or output section.
Action: Add brief workflow section or document as reference-injection skill; add model: claude-haiku-4-5.

12. C-Grade Skills

otel-improvement — C (35/60)

Strengths: Best-in-class definition (10/10) and prompting (10/10); closed-loop design with state file; grade calculation tables.
Issues: Zero telemetry invocations — possible invocation mechanism mismatch; missing model.
Action: Investigate why plugin.name: otel-improvement never appears in traces; add model.

codebase-analyzer — C (35/60)

Strengths: Good ## When to Use with concrete scenarios; Mermaid diagram generation; safety guidelines.
Issues: Missing allowed-tools and model; no role statement; hardcoded code-refactor-agent subagent may not exist; rm -f in Phase 4 without confirmation.
Action: Add frontmatter fields; verify subagent name; strengthen Phase 4 guardrails.

feature-dev — C (34/60)

Strengths: Logical 7-phase structure; user confirmation gates; “understand before acting” principle.
Issues: Missing allowed-tools, model, ## When, output format, examples, tables.
Action: Add all missing frontmatter fields and structural sections.

agent-improvement — C (33/60)

Strengths: Best-structured prompt among new skills; comprehensive error handling; EMA state persistence.
Issues: Brand new (zero telemetry, untracked in git); missing model; no role statement.
Action: Add model: claude-sonnet-4-6 and role statement; monitor after first use.

otel-output-provenance — C (31/60)

Strengths: Excellent definition completeness (9/10); unique multi-session lineage tracking; threshold mapping table.
Issues: Zero telemetry; missing model; significant functional overlap with otel-quality-reporting; 201 lines (1 over limit).
Action: Investigate zero invocations; move differentiation statement into description; add model.

13. Appendix: Telemetry Data Sources

Trace files: ~/.claude/telemetry/traces-2026-02-*.jsonl (21 files, Feb 1–21)
Span types: hook:agent-post-tool, hook:agent-pre-tool, hook:plugin-pre-tool, hook:plugin-post-tool, hook:skill-activation-prompt
Extraction: No Bash pipelines used for telemetry extraction; all queries via Grep/Read on JSONL files
Error rate: Derived from agent.has_error / plugin.has_error span attributes
Usage alignment: From agent.category / plugin.category in pre-tool spans