| Agent | Telemetry | Definition | Prompting | Overlap | Alignment | Efficiency | Total | Grade |
|---|---|---|---|---|---|---|---|---|
| code-reviewer | 9 | 9 | 9 | 8 | 7 | 8 | 50 | A |
| auto-error-resolver | 7 | 10 | 10 | 8 | 10 | 6 | 49 | A |
| genai-quality-monitor | 9 | 10 | 10 | 5 | 4 | 9 | 47 | B |
| webscraping-research-analyst | 5 | 10 | 10 | 5 | 9 | 7 | 46 | B |
| code-simplifier | 4 | 10 | 10 | 5 | 8 | 8 | 45 | B |
| agent-auditor | 1 | 8 | 9 | 9 | 0 | 2 | 29 | C |
Executive Summary
This audit scores 25 agent and skill definitions across 6 dimensions using the agent-auditor rubric. Two definitions earned A grades (code-reviewer, auto-error-resolver), thirteen scored B, six scored C, and none scored D. Four critical issues were identified: a runtime-breaking typo in otel-quality-reporting, a security gap in git-commit-smart, a 100% error rate on agent-auditor, and zero telemetry for two OTEL skills.
Grade scale: A = 48–60 | B = 36–47 | C = 24–35 | D = <24
1. Active Agents Scorecard
2. Skill-Embedded Agents
Note: Skill-embedded agents score 0/10 on Telemetry, Alignment, and Efficiency (no separate spans emitted). Adjusted grade uses deterministic dimensions only (/30 scale: A-eq = 24–30, B-eq = 18–23, C-eq = 12–17).
| Agent | Definition | Prompting | Overlap | Det /30 | Adj Grade |
|---|---|---|---|---|---|
| html-auditor | 9 | 8 | 5 | 22 | A-eq |
| code-reviewer (skill) | 9 | 5.5 | 5 | 19.5 | B-eq |
| code-explorer (skill) | 9 | 5 | 5 | 19 | B-eq |
| code-architect (skill) | 9 | 5 | 5 | 19 | B-eq |
3. Skills Scorecard
| Skill | Telemetry | Definition | Prompting | Overlap | Alignment | Efficiency | Total | Grade |
|---|---|---|---|---|---|---|---|---|
| content-creator | 8 | 9 | 8 | 8 | 7 | 7 | 47 | B |
| otel-session-summary | 7 | 9 | 9 | 6 | 7 | 8 | 46 | B |
| session-backlog | 4 | 8 | 9 | 8 | 7 | 8 | 44 | B |
| otel-quality-reporting | 8 | 8 | 9 | 5 | 8 | 5 | 43 | B |
| backlog-implementer | 4 | 7 | 9 | 7 | 7 | 7 | 41 | B |
| bug-detective | 7 | 6 | 8 | 7 | 7 | 6 | 41 | B |
| git-commit-smart | 7 | 5 | 4 | 8 | 8 | 7 | 39 | B |
| frontend-design | 7 | 4 | 6 | 8 | 8 | 5 | 38 | B |
| ralph-wiggum | 8 | 5 | 5 | 8 | 6 | 5 | 37 | B |
| cicd-cross-platform | 6 | 6 | 6 | 8 | 5 | 5 | 36 | B |
| otel-improvement | 0 | 10 | 10 | 7 | 0 | 8 | 35 | C |
| codebase-analyzer | 5 | 5 | 6 | 7 | 6 | 6 | 35 | C |
| feature-dev | 6 | 5 | 6 | 6 | 5 | 6 | 34 | C |
| agent-improvement | 1 | 9 | 8 | 7 | 1 | 7 | 33 | C |
| otel-output-provenance | 0 | 9 | 9 | 6 | 0 | 7 | 31 | C |
4. Grade Distribution
| Grade | Count | Definitions |
|---|---|---|
| A | 2 | code-reviewer (50), auto-error-resolver (49) |
| B | 13 | genai-quality-monitor (47), content-creator (47), webscraping-research-analyst (46), otel-session-summary (46), code-simplifier (45), session-backlog (44), otel-quality-reporting (43), backlog-implementer (41), bug-detective (41), git-commit-smart (39), frontend-design (38), ralph-wiggum (37), cicd-cross-platform (36) |
| C | 6 | otel-improvement (35), codebase-analyzer (35), feature-dev (34), agent-improvement (33), otel-output-provenance (31), agent-auditor (29) |
| D | 0 | — |
Skill-Embedded Agents (Adjusted /30 Scale)
| Adj Grade | Count | Definitions |
|---|---|---|
| A-eq | 1 | html-auditor (22/30) |
| B-eq | 3 | code-reviewer-skill (19.5/30), code-explorer (19/30), code-architect (19/30) |
5. Scoring Dimensions Reference
| # | Dimension | Max | Method |
|---|---|---|---|
| 1 | Telemetry Health | 10 | Usage frequency (35%), error rate (25%), trend (20%), session diversity (20%) from OTEL traces |
| 2 | Definition Quality | 10 | 10-point checklist: name, description, tools, model, description length, ## sections, ## When, output format, line count 30–200, tools restricted |
| 3 | Prompt Engineering | 10 | Role statement (1), numbered steps (1.5), guardrails (1), examples (1), tables (1), output spec (1.5), markdown structure (1), scope boundaries (1) |
| 4 | Overlap & Redundancy | 10 | Jaccard similarity on tool sets and description keywords; flag if tool-Jaccard >0.8 AND keyword-Jaccard >0.5 |
| 5 | Usage Alignment | 10 | Match expected category from description vs actual agent.category / plugin.category in OTEL |
| 6 | Efficiency & Cost | 10 | Duration percentile (30%), retry/error amplification (25%), output density (25%), background appropriateness (20%) |
6. Cross-Cutting Findings
Universal Gaps
- Missing
modelfield: 0/15 skills specify a model in frontmatter — all should add one - Missing
allowed-tools: 7 skills lack tool restrictions (feature-dev, bug-detective, cicd-cross-platform, ralph-wiggum, codebase-analyzer, frontend-design, backlog-implementer) - No role statements: 0/15 skills open with “You are…” — all use functional/descriptive openings
Critical Issues
P0 Runtime failure:
otel-quality-reporting has a typo in Phase 1–2 script paths (otel-quality-reportinging — double “ing”) that causes FileNotFoundError.
P0 Security risk:
git-commit-smart uses git add -A with no guardrail against staging .env, credentials, or secrets.
P1 100% error rate:
agent-auditor has 100% error rate on all 5 observed spans (Feb 21); 307 lines exceeds its own 200-line guideline.
P1 Zero telemetry:
otel-improvement and otel-output-provenance have 0 plugin invocations despite references in traces — likely invocation mechanism mismatch.
Structural Observations
- Telemetry blind spot: Skill-embedded agents (feature-dev/agents/*, otel-improvement/agents/*) emit no separate spans; 3/6 scoring dimensions are structurally unavailable for them.
- Overlap cluster: otel-session-summary / otel-quality-reporting / otel-output-provenance share tools and domain;
## Whensections need clearer disambiguation. - Pattern mismatch:
cicd-cross-platformfunctions as injected reference context (280 keyword matches) rather than an interactive skill (6 explicit invocations) — should be documented accordingly. - Declining trends: bug-detective, ralph-wiggum, content-creator, and feature-dev show declining usage in last 7 days.
7. Priority Action Items
| Priority | Action | Target | Impact |
|---|---|---|---|
| P0 | Fix otel-quality-reportinging script path typo | otel-quality-reporting | Runtime failures in Phase 1–2 |
| P0 | Add sensitive-file guardrail (never stage .env/credentials) | git-commit-smart | Security risk from git add -A |
| P1 | Add model to all 15 skills | All skills | Routing reliability, cost control |
| P1 | Add allowed-tools to 7 skills missing it | feature-dev, bug-detective, cicd-cross-platform, ralph-wiggum, codebase-analyzer, frontend-design, backlog-implementer | Tool scope control |
| P1 | Investigate agent-auditor 100% error rate | agent-auditor | Reliability |
| P1 | Trim agent-auditor to <200 lines | agent-auditor | Self-consistency |
| P2 | Investigate zero telemetry for otel-improvement, otel-output-provenance | Invocation mechanism | Observability gap |
| P2 | Add ## When sections to skills missing them | session-backlog, frontend-design, git-commit-smart | Auto-routing accuracy |
| P2 | Add ## When routing to all 4 skill-embedded agents | feature-dev/agents/*, otel-improvement/agents/* | Routing reliability |
| P3 | Add role statements (“You are…”) to all skills | All skills | Prompt engineering consistency |
| P3 | Add numbered workflow steps to ralph-wiggum | ralph-wiggum | Operational clarity |
| P3 | Disambiguate otel-session-summary vs otel-quality-reporting in ## When | Both skills | Reduce routing confusion |
8. A-Grade Agents
code-reviewer — A (50/60)
Strengths: Highest usage (322 invocations/Feb, 20 active days), clean read-only tool set, stable week-over-week trend, near-zero error rate (4%).
- Issues: Description too terse for reliable auto-routing (“Expert code reviewer. Use proactively after code changes.” — lacks TypeScript/React/Node.js signal); category split between
reviewandcode. - Action: Expand description to include stack keywords.
auto-error-resolver — A (49/60)
Strengths: Perfect definition quality (10/10) and prompt engineering (10/10), 100% category alignment (
error-handling), rising usage (4x week-over-week).
- Issues: 7.1% error rate in the 5–15% penalty band (3 errors on Feb 17); session diversity borderline at ~3 sessions.
- Action: Investigate Feb 17 error cluster.
9. B-Grade Agents
genai-quality-monitor — B (47/60)
- Strengths: Rising usage (2x week-over-week), zero errors, domain-specific evaluation tables, Analysis Commands section with real bash queries.
- Issues: Categorized as
general67% of the time instead ofobservability— 1 point from A-grade. - Action: Add “OTEL” or “observability” to description field for stronger categorizer signal.
webscraping-research-analyst — B (46/60)
- Strengths: Best output density (14,728 bytes), excellent session diversity (7 sessions), perfect category alignment.
- Issues: 25% error rate on instrumented spans (Feb 17 error with
has_actions:true); declining trend. - Action: Investigate Feb 17 error (session
40dc9c66); check if agent crossed into implementation despite guardrails.
code-simplifier — B (45/60)
- Strengths: Perfect definition and prompting (10/10 each), zero errors, well-scoped Simplification Patterns table.
- Issues: Near-zero usage (1 confirmed invocation in Jan); high tool Jaccard (0.857) with lazy
code-refactor-agent. - Action: Update description to include routing keywords like “cleanup” or “readability.”
10. C-Grade Agent
agent-auditor — C (29/60)
Critical: 100% error rate on all 5 observed invocations; 307 lines exceeds 200-line guideline; miscategorized as
security instead of observability; zero usage before Feb 21.
- Strengths: Exceptional prompt engineering (9/10) with phases, tables, guardrails, and output templates; appropriate background usage.
- Action: Trim to <200 lines; fix category routing; investigate error root cause.
11. B-Grade Skills
content-creator — B (47/60)
- Strengths: Highest telemetry among skills (114 spans), comprehensive frontmatter with MCP tools, excellent voice/tone templates.
- Issues: Missing
model; declining trend (4 spans last 7d vs 46 prior). - Action: Add
model: claude-sonnet-4-6; investigate usage decline.
otel-session-summary — B (46/60)
- Strengths: Tight 3-tool footprint (Bash, Read, Glob), leanest complete skill, rising usage, clear scope boundaries.
- Issues: Missing
model; overlap with otel-quality-reporting on LLM-as-Judge metrics. - Action: Add
model: claude-haiku-4-5; add disambiguation note to## Whensection.
session-backlog — B (44/60)
- Strengths: Excellent Rules section, clear deduplication logic, well-specified output format.
- Issues: Missing
model; missing## Whenrouting section; no role statement. - Action: Add
modeland## When to Use.
otel-quality-reporting — B (43/60)
P0 Critical typo in script paths (
otel-quality-reportinging) causes runtime failure.
- Strengths: Most complete output specification, quality checklist for self-verification, well-defined publish pipeline.
- Issues: Description 274 chars (over 200 limit); missing
model; overlap with otel-session-summary. - Action: Fix script path typo immediately; trim description; add
model: claude-sonnet-4-6.
backlog-implementer — B (41/60)
- Strengths: Best workflow detail (8 sub-steps in Phase 2), unique memory compaction protocol, code review integration per commit.
- Issues: Missing
allowed-toolsandmodel; description slightly over 200 chars. - Action: Add
allowed-toolsandmodel: claude-sonnet-4-6; trim description.
bug-detective — B (41/60)
- Strengths: Best DO/DON’T examples, P0–P3 priority matrix, time allocation table, Quick Start entry point.
- Issues: Missing
allowed-tools,model, output section; declining usage; 277 lines near upper bound. - Action: Add frontmatter fields; promote “When to use” to
## When to Useheading.
git-commit-smart — B (39/60)
P0 No guardrail against staging secrets with
git add -A. Only 28 lines (below 30-line minimum).
- Strengths: Most-used skill (tied at 18 invocations), minimal tool list, clear commit format spec.
- Action: Add sensitive-file guardrail; expand to 30+ lines; add
modeland constraints section.
frontend-design — B (38/60)
- Strengths: High usage (18 invocations, rising), excellent guardrails against generic AI aesthetics.
- Issues: Missing
allowed-tools,model,## When; description 215 chars (over limit); no workflow phases. - Action: Add light workflow (2–3 phases); add
## When to Use; trim description.
ralph-wiggum — B (37/60)
- Strengths: Unique Stop-hook feedback loop pattern; clear Good for/Not good for guidance.
- Issues: Lowest prompt engineering among B-grade (5/10); no role statement, workflow steps, tables, or output spec.
- Action: Add role statement, numbered iteration steps, output format,
allowed-tools, andmodel.
cicd-cross-platform — B (36/60)
- Strengths: High-quality code examples; recommendation matrix; wide keyword coverage (280 ambient matches).
- Issues: Reference document, not workflow skill; no role statement, workflow,
allowed-tools,model, or output section. - Action: Add brief workflow section or document as reference-injection skill; add
model: claude-haiku-4-5.
12. C-Grade Skills
otel-improvement — C (35/60)
- Strengths: Best-in-class definition (10/10) and prompting (10/10); closed-loop design with state file; grade calculation tables.
- Issues: Zero telemetry invocations — possible invocation mechanism mismatch; missing
model. - Action: Investigate why
plugin.name: otel-improvementnever appears in traces; addmodel.
codebase-analyzer — C (35/60)
- Strengths: Good
## When to Usewith concrete scenarios; Mermaid diagram generation; safety guidelines. - Issues: Missing
allowed-toolsandmodel; no role statement; hardcodedcode-refactor-agentsubagent may not exist;rm -fin Phase 4 without confirmation. - Action: Add frontmatter fields; verify subagent name; strengthen Phase 4 guardrails.
feature-dev — C (34/60)
- Strengths: Logical 7-phase structure; user confirmation gates; “understand before acting” principle.
- Issues: Missing
allowed-tools,model,## When, output format, examples, tables. - Action: Add all missing frontmatter fields and structural sections.
agent-improvement — C (33/60)
- Strengths: Best-structured prompt among new skills; comprehensive error handling; EMA state persistence.
- Issues: Brand new (zero telemetry, untracked in git); missing
model; no role statement. - Action: Add
model: claude-sonnet-4-6and role statement; monitor after first use.
otel-output-provenance — C (31/60)
- Strengths: Excellent definition completeness (9/10); unique multi-session lineage tracking; threshold mapping table.
- Issues: Zero telemetry; missing
model; significant functional overlap with otel-quality-reporting; 201 lines (1 over limit). - Action: Investigate zero invocations; move differentiation statement into description; add
model.
13. Appendix: Telemetry Data Sources
- Trace files:
~/.claude/telemetry/traces-2026-02-*.jsonl(21 files, Feb 1–21) - Span types:
hook:agent-post-tool,hook:agent-pre-tool,hook:plugin-pre-tool,hook:plugin-post-tool,hook:skill-activation-prompt - Extraction: No Bash pipelines used for telemetry extraction; all queries via Grep/Read on JSONL files
- Error rate: Derived from
agent.has_error/plugin.has_errorspan attributes - Usage alignment: From
agent.category/plugin.categoryin pre-tool spans