Agent & Skill Quality Audit

Consolidated Scorecard & Findings
Date: February 20, 2026 | Audited: 25 definitions (6 agents, 4 skill-embedded, 15 skills) | Method: agent-auditor rubric (6 dimensions, max 60)

Executive Summary

This audit scores 25 agent and skill definitions across 6 dimensions using the agent-auditor rubric. Two definitions earned A grades (code-reviewer, auto-error-resolver), thirteen scored B, six scored C, and none scored D. Four critical issues were identified: a runtime-breaking typo in otel-quality-reporting, a security gap in git-commit-smart, a 100% error rate on agent-auditor, and zero telemetry for two OTEL skills.

Grade scale: A = 48–60 | B = 36–47 | C = 24–35 | D = <24

1. Active Agents Scorecard

Agent Telemetry Definition Prompting Overlap Alignment Efficiency Total Grade
code-reviewer 999878 50 A
auto-error-resolver 710108106 49 A
genai-quality-monitor 91010549 47 B
webscraping-research-analyst 51010597 46 B
code-simplifier 41010588 45 B
agent-auditor 189902 29 C

2. Skill-Embedded Agents

Note: Skill-embedded agents score 0/10 on Telemetry, Alignment, and Efficiency (no separate spans emitted). Adjusted grade uses deterministic dimensions only (/30 scale: A-eq = 24–30, B-eq = 18–23, C-eq = 12–17).
Agent Definition Prompting Overlap Det /30 Adj Grade
html-auditor 985 22 A-eq
code-reviewer (skill) 95.55 19.5 B-eq
code-explorer (skill) 955 19 B-eq
code-architect (skill) 955 19 B-eq

3. Skills Scorecard

Skill Telemetry Definition Prompting Overlap Alignment Efficiency Total Grade
content-creator 898877 47 B
otel-session-summary 799678 46 B
session-backlog 489878 44 B
otel-quality-reporting 889585 43 B
backlog-implementer 479777 41 B
bug-detective 768776 41 B
git-commit-smart 754887 39 B
frontend-design 746885 38 B
ralph-wiggum 855865 37 B
cicd-cross-platform 666855 36 B
otel-improvement 01010708 35 C
codebase-analyzer 556766 35 C
feature-dev 656656 34 C
agent-improvement 198717 33 C
otel-output-provenance 099607 31 C

4. Grade Distribution

GradeCountDefinitions
A 2 code-reviewer (50), auto-error-resolver (49)
B 13 genai-quality-monitor (47), content-creator (47), webscraping-research-analyst (46), otel-session-summary (46), code-simplifier (45), session-backlog (44), otel-quality-reporting (43), backlog-implementer (41), bug-detective (41), git-commit-smart (39), frontend-design (38), ralph-wiggum (37), cicd-cross-platform (36)
C 6 otel-improvement (35), codebase-analyzer (35), feature-dev (34), agent-improvement (33), otel-output-provenance (31), agent-auditor (29)
D 0

Skill-Embedded Agents (Adjusted /30 Scale)

Adj GradeCountDefinitions
A-eq 1 html-auditor (22/30)
B-eq 3 code-reviewer-skill (19.5/30), code-explorer (19/30), code-architect (19/30)

5. Scoring Dimensions Reference

#DimensionMaxMethod
1Telemetry Health10Usage frequency (35%), error rate (25%), trend (20%), session diversity (20%) from OTEL traces
2Definition Quality1010-point checklist: name, description, tools, model, description length, ## sections, ## When, output format, line count 30–200, tools restricted
3Prompt Engineering10Role statement (1), numbered steps (1.5), guardrails (1), examples (1), tables (1), output spec (1.5), markdown structure (1), scope boundaries (1)
4Overlap & Redundancy10Jaccard similarity on tool sets and description keywords; flag if tool-Jaccard >0.8 AND keyword-Jaccard >0.5
5Usage Alignment10Match expected category from description vs actual agent.category / plugin.category in OTEL
6Efficiency & Cost10Duration percentile (30%), retry/error amplification (25%), output density (25%), background appropriateness (20%)

6. Cross-Cutting Findings

Universal Gaps

  • Missing model field: 0/15 skills specify a model in frontmatter — all should add one
  • Missing allowed-tools: 7 skills lack tool restrictions (feature-dev, bug-detective, cicd-cross-platform, ralph-wiggum, codebase-analyzer, frontend-design, backlog-implementer)
  • No role statements: 0/15 skills open with “You are…” — all use functional/descriptive openings

Critical Issues

P0 Runtime failure: otel-quality-reporting has a typo in Phase 1–2 script paths (otel-quality-reportinging — double “ing”) that causes FileNotFoundError.
P0 Security risk: git-commit-smart uses git add -A with no guardrail against staging .env, credentials, or secrets.
P1 100% error rate: agent-auditor has 100% error rate on all 5 observed spans (Feb 21); 307 lines exceeds its own 200-line guideline.
P1 Zero telemetry: otel-improvement and otel-output-provenance have 0 plugin invocations despite references in traces — likely invocation mechanism mismatch.

Structural Observations

  • Telemetry blind spot: Skill-embedded agents (feature-dev/agents/*, otel-improvement/agents/*) emit no separate spans; 3/6 scoring dimensions are structurally unavailable for them.
  • Overlap cluster: otel-session-summary / otel-quality-reporting / otel-output-provenance share tools and domain; ## When sections need clearer disambiguation.
  • Pattern mismatch: cicd-cross-platform functions as injected reference context (280 keyword matches) rather than an interactive skill (6 explicit invocations) — should be documented accordingly.
  • Declining trends: bug-detective, ralph-wiggum, content-creator, and feature-dev show declining usage in last 7 days.

7. Priority Action Items

PriorityActionTargetImpact
P0Fix otel-quality-reportinging script path typootel-quality-reportingRuntime failures in Phase 1–2
P0Add sensitive-file guardrail (never stage .env/credentials)git-commit-smartSecurity risk from git add -A
P1Add model to all 15 skillsAll skillsRouting reliability, cost control
P1Add allowed-tools to 7 skills missing itfeature-dev, bug-detective, cicd-cross-platform, ralph-wiggum, codebase-analyzer, frontend-design, backlog-implementerTool scope control
P1Investigate agent-auditor 100% error rateagent-auditorReliability
P1Trim agent-auditor to <200 linesagent-auditorSelf-consistency
P2Investigate zero telemetry for otel-improvement, otel-output-provenanceInvocation mechanismObservability gap
P2Add ## When sections to skills missing themsession-backlog, frontend-design, git-commit-smartAuto-routing accuracy
P2Add ## When routing to all 4 skill-embedded agentsfeature-dev/agents/*, otel-improvement/agents/*Routing reliability
P3Add role statements (“You are…”) to all skillsAll skillsPrompt engineering consistency
P3Add numbered workflow steps to ralph-wiggumralph-wiggumOperational clarity
P3Disambiguate otel-session-summary vs otel-quality-reporting in ## WhenBoth skillsReduce routing confusion

8. A-Grade Agents

code-reviewer — A (50/60)

Strengths: Highest usage (322 invocations/Feb, 20 active days), clean read-only tool set, stable week-over-week trend, near-zero error rate (4%).
  • Issues: Description too terse for reliable auto-routing (“Expert code reviewer. Use proactively after code changes.” — lacks TypeScript/React/Node.js signal); category split between review and code.
  • Action: Expand description to include stack keywords.

auto-error-resolver — A (49/60)

Strengths: Perfect definition quality (10/10) and prompt engineering (10/10), 100% category alignment (error-handling), rising usage (4x week-over-week).
  • Issues: 7.1% error rate in the 5–15% penalty band (3 errors on Feb 17); session diversity borderline at ~3 sessions.
  • Action: Investigate Feb 17 error cluster.

9. B-Grade Agents

genai-quality-monitor — B (47/60)

  • Strengths: Rising usage (2x week-over-week), zero errors, domain-specific evaluation tables, Analysis Commands section with real bash queries.
  • Issues: Categorized as general 67% of the time instead of observability — 1 point from A-grade.
  • Action: Add “OTEL” or “observability” to description field for stronger categorizer signal.

webscraping-research-analyst — B (46/60)

  • Strengths: Best output density (14,728 bytes), excellent session diversity (7 sessions), perfect category alignment.
  • Issues: 25% error rate on instrumented spans (Feb 17 error with has_actions:true); declining trend.
  • Action: Investigate Feb 17 error (session 40dc9c66); check if agent crossed into implementation despite guardrails.

code-simplifier — B (45/60)

  • Strengths: Perfect definition and prompting (10/10 each), zero errors, well-scoped Simplification Patterns table.
  • Issues: Near-zero usage (1 confirmed invocation in Jan); high tool Jaccard (0.857) with lazy code-refactor-agent.
  • Action: Update description to include routing keywords like “cleanup” or “readability.”

10. C-Grade Agent

agent-auditor — C (29/60)

Critical: 100% error rate on all 5 observed invocations; 307 lines exceeds 200-line guideline; miscategorized as security instead of observability; zero usage before Feb 21.
  • Strengths: Exceptional prompt engineering (9/10) with phases, tables, guardrails, and output templates; appropriate background usage.
  • Action: Trim to <200 lines; fix category routing; investigate error root cause.

11. B-Grade Skills

content-creator — B (47/60)

  • Strengths: Highest telemetry among skills (114 spans), comprehensive frontmatter with MCP tools, excellent voice/tone templates.
  • Issues: Missing model; declining trend (4 spans last 7d vs 46 prior).
  • Action: Add model: claude-sonnet-4-6; investigate usage decline.

otel-session-summary — B (46/60)

  • Strengths: Tight 3-tool footprint (Bash, Read, Glob), leanest complete skill, rising usage, clear scope boundaries.
  • Issues: Missing model; overlap with otel-quality-reporting on LLM-as-Judge metrics.
  • Action: Add model: claude-haiku-4-5; add disambiguation note to ## When section.

session-backlog — B (44/60)

  • Strengths: Excellent Rules section, clear deduplication logic, well-specified output format.
  • Issues: Missing model; missing ## When routing section; no role statement.
  • Action: Add model and ## When to Use.

otel-quality-reporting — B (43/60)

P0 Critical typo in script paths (otel-quality-reportinging) causes runtime failure.
  • Strengths: Most complete output specification, quality checklist for self-verification, well-defined publish pipeline.
  • Issues: Description 274 chars (over 200 limit); missing model; overlap with otel-session-summary.
  • Action: Fix script path typo immediately; trim description; add model: claude-sonnet-4-6.

backlog-implementer — B (41/60)

  • Strengths: Best workflow detail (8 sub-steps in Phase 2), unique memory compaction protocol, code review integration per commit.
  • Issues: Missing allowed-tools and model; description slightly over 200 chars.
  • Action: Add allowed-tools and model: claude-sonnet-4-6; trim description.

bug-detective — B (41/60)

  • Strengths: Best DO/DON’T examples, P0–P3 priority matrix, time allocation table, Quick Start entry point.
  • Issues: Missing allowed-tools, model, output section; declining usage; 277 lines near upper bound.
  • Action: Add frontmatter fields; promote “When to use” to ## When to Use heading.

git-commit-smart — B (39/60)

P0 No guardrail against staging secrets with git add -A. Only 28 lines (below 30-line minimum).
  • Strengths: Most-used skill (tied at 18 invocations), minimal tool list, clear commit format spec.
  • Action: Add sensitive-file guardrail; expand to 30+ lines; add model and constraints section.

frontend-design — B (38/60)

  • Strengths: High usage (18 invocations, rising), excellent guardrails against generic AI aesthetics.
  • Issues: Missing allowed-tools, model, ## When; description 215 chars (over limit); no workflow phases.
  • Action: Add light workflow (2–3 phases); add ## When to Use; trim description.

ralph-wiggum — B (37/60)

  • Strengths: Unique Stop-hook feedback loop pattern; clear Good for/Not good for guidance.
  • Issues: Lowest prompt engineering among B-grade (5/10); no role statement, workflow steps, tables, or output spec.
  • Action: Add role statement, numbered iteration steps, output format, allowed-tools, and model.

cicd-cross-platform — B (36/60)

  • Strengths: High-quality code examples; recommendation matrix; wide keyword coverage (280 ambient matches).
  • Issues: Reference document, not workflow skill; no role statement, workflow, allowed-tools, model, or output section.
  • Action: Add brief workflow section or document as reference-injection skill; add model: claude-haiku-4-5.

12. C-Grade Skills

otel-improvement — C (35/60)

  • Strengths: Best-in-class definition (10/10) and prompting (10/10); closed-loop design with state file; grade calculation tables.
  • Issues: Zero telemetry invocations — possible invocation mechanism mismatch; missing model.
  • Action: Investigate why plugin.name: otel-improvement never appears in traces; add model.

codebase-analyzer — C (35/60)

  • Strengths: Good ## When to Use with concrete scenarios; Mermaid diagram generation; safety guidelines.
  • Issues: Missing allowed-tools and model; no role statement; hardcoded code-refactor-agent subagent may not exist; rm -f in Phase 4 without confirmation.
  • Action: Add frontmatter fields; verify subagent name; strengthen Phase 4 guardrails.

feature-dev — C (34/60)

  • Strengths: Logical 7-phase structure; user confirmation gates; “understand before acting” principle.
  • Issues: Missing allowed-tools, model, ## When, output format, examples, tables.
  • Action: Add all missing frontmatter fields and structural sections.

agent-improvement — C (33/60)

  • Strengths: Best-structured prompt among new skills; comprehensive error handling; EMA state persistence.
  • Issues: Brand new (zero telemetry, untracked in git); missing model; no role statement.
  • Action: Add model: claude-sonnet-4-6 and role statement; monitor after first use.

otel-output-provenance — C (31/60)

  • Strengths: Excellent definition completeness (9/10); unique multi-session lineage tracking; threshold mapping table.
  • Issues: Zero telemetry; missing model; significant functional overlap with otel-quality-reporting; 201 lines (1 over limit).
  • Action: Investigate zero invocations; move differentiation statement into description; add model.

13. Appendix: Telemetry Data Sources

  • Trace files: ~/.claude/telemetry/traces-2026-02-*.jsonl (21 files, Feb 1–21)
  • Span types: hook:agent-post-tool, hook:agent-pre-tool, hook:plugin-pre-tool, hook:plugin-post-tool, hook:skill-activation-prompt
  • Extraction: No Bash pipelines used for telemetry extraction; all queries via Grep/Read on JSONL files
  • Error rate: Derived from agent.has_error / plugin.has_error span attributes
  • Usage alignment: From agent.category / plugin.category in pre-tool spans