developer
GSkillPacks
Public catalog of reusable agentic workflows and skill packs for Claude Code and Codex agents — structured prompt-to-ship pipelines with inspectable sessions, pack mapping, and proof-of-work documentation.
www.gskillpacks.com/activedeveloper
Key metrics
Metrics being defined
Competitive Intel
0 entries
No competitive intel entries for this product yet.
Research Hub
7 types
- Devtool Adoption→
- Devtool Docs Audit→
- Devtool Dx Journey→
- Devtool Integration Map→
- Devtool Monetization→
- Devtool Positioning→
- Devtool User Map→
Roadmap
1217 items
Done1180
`--dry-run` flag on kanban skills shows intended card operations without DB writes
`/admin/newsletter` requires the configured admin secret.
`/follow` submits valid email addresses through a first-party tRPC mutation.
`/reconcile-dev-docs fix tasks` - Resolved orphaned Phase 38 manual tasks: 4 items deferred to future work (Neon DB, admin secret, Vercel env vars, live verification).
`/skills` command lists skills grouped by workflow stage with keyword search
`$benchmark-test-skill <skill>` reports custom/generic/blocked coverage status.
`$benchmark-test-skill <skill>` reports custom/generic/blocked coverage status.
`$benchmark-test-skill <skill>` reports hard pass rate separately from quality score.
`$benchmark-test-skill <skill>` reports hard pass rate separately from quality score.
`$run`, `$ship`, `$ship-end`, and `$commit-and-push-by-feature` require a ship manifest for non-trivial source changes.
`$run`, `$ship`, `$ship-end`, and `$commit-and-push-by-feature` require a ship manifest for non-trivial source changes.
`analyze-sessions` no longer owns targeted skill retrospective or single-incident triage behavior.
`benchmark/review-analyze-sessions-2026-05-15.md` records scores, findings, remediation, and next route.
`benchmark/review-analyze-sessions-2026-05-15.md` records scores, findings, remediation, and next route.
`benchmark/review-analyze-sessions-2026-05-15.md` records scores, findings, remediation, and next route.
`benchmark/review-benchmark-test-skill-2026-05-13.md` records scores, findings, remediation, and next route.
`benchmark/review-content-programming-2026-05-14.md` records scores, findings, remediation, and next route.
`benchmark/review-icon-handler-2026-05-14.md` records scores, findings, remediation, and next route.
`benchmark/review-session-triage-2026-05-13.md` records scores, findings, remediation, and next route.
`benchmark/review-session-triage-2026-05-13.md` records scores, findings, remediation, and next route.
`benchmark/review-session-triage-2026-05-13.md` records source reports, run indexes, scores, findings, remediation, and next route.
`benchmark/test-analyze-sessions-2026-05-15.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-analyze-sessions-2026-05-15.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-analyze-sessions-2026-05-15.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-analyze-sessions-2026-05-15.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-benchmark-test-skill-2026-05-12.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-benchmark-test-skill-2026-05-12.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-benchmark-test-skill-2026-05-12.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-benchmark-test-skill-2026-05-12.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-benchmark-test-skill-2026-05-12.md` records verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-benchmark-test-skill-2026-05-13.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-benchmark-test-skill-2026-05-13.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-content-programming-2026-05-14.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-content-programming-2026-05-14.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-content-programming-2026-05-14.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-design-system-2026-05-10.md` reflects the latest `report.json` metrics, failures, latency, cost, consistency, and raw session path.
`benchmark/test-icon-handler-2026-05-13.md` records verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-icon-handler-2026-05-14.md` records verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-icon-handler-2026-05-14.md` records verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-run-2026-05-11.md` records verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-session-triage-2026-05-13.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-session-triage-2026-05-13.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-session-triage-2026-05-13.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-session-triage-2026-05-13.md` records fresh verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-session-triage-2026-05-13.md` records verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-session-triage-2026-05-13.md` records verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-ship-2026-05-11.md` records verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-ship-2026-05-11.md` records verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/test-spec-interview-2026-05-12.md` records verify, benchmark, latency, cost, consistency, and raw session evidence.
`benchmark/triage-analyze-sessions-2026-05-15.md` records verdict, root cause, responsible gap, validation plan, and next route.
`benchmark/triage-benchmark-test-skill-2026-05-13.md` records verdict, root cause, responsible gap, validation plan, and next route.
`benchmark/triage-content-programming-2026-05-14-quality.md` records verdict, root cause, responsible gap, validation plan, and next route.
`benchmark/triage-content-programming-2026-05-14.md` records verdict, root cause, responsible gap, validation plan, and next route.
`benchmark/triage-icon-handler-2026-05-13.md` records verdict, root cause, responsible gap, validation plan, and next route.
`benchmark/triage-icon-handler-2026-05-14-image.md` records verdict, root cause, responsible gap, validation plan, and next route.
`benchmark/triage-session-triage-2026-05-13.md` records verdict, root cause, responsible gap, validation plan, and next route.
`benchmark/triage-session-triage-2026-05-13.md` records verdict, root cause, responsible gap, validation plan, and next route.
`benchmark/triage-session-triage-2026-05-13.md` records verdict, root cause, responsible gap, validation plan, and next route.
`cmdArchiveCard` handles orphaned list/board references gracefully
`commit-and-push-by-feature` has a safe fixture plan using an explicit-permission disposable GitHub test repository.
`commit-and-push-by-feature` has a safe fixture plan using an explicit-permission disposable GitHub test repository.
`content-programming` hard assertions and quality scoring accept Claude `/series-spec` and Codex `$series-spec`.
`create-board --template standard` creates a board with all 5 required lists in correct types
`creator-evidence-schema` exists for both Claude and Codex.
`creator-evidence-schema` exists for both Claude and Codex.
`creator-media` no longer installs those Remotion production skills by default.
`creator-platform-capability-matrix` exists for both Claude and Codex.
`creator-platform-capability-matrix` exists for both Claude and Codex.
`creator-presence-dossier` exists for both Claude and Codex.
`creator-presence-dossier` exists for both Claude and Codex.
`docs/skills-reference.md` output paths match actual skill behavior
`global/claude/analyze-sessions/SKILL.md` documents the same targeted retrospective workflow.
`global/claude/feature-interview/SKILL.md` mirrors the same workflow with Claude command prefixes.
`global/claude/pack` and `global/codex/pack` describe mixed-monorepo routing.
`global/codex/analyze-sessions/SKILL.md` documents targeted skill-performance retrospectives.
`global/codex/codebase-status/SKILL.md` and `global/claude/codebase-status/SKILL.md` exist with versioned frontmatter.
`global/codex/feature-interview/SKILL.md` and `global/claude/feature-interview/SKILL.md` exist with versioned frontmatter.
`global/codex/feature-interview/SKILL.md` requires evidence-backed claim validation and technical gotcha discovery before deep user interrogation.
`global/codex/install-agentic-skills/SKILL.md` and `global/claude/install-agentic-skills/SKILL.md` exist with versioned frontmatter.
`global/codex/targeted-skill-builder/SKILL.md` and `global/claude/targeted-skill-builder/SKILL.md` exist with versioned frontmatter.
`kanban.mjs` is documented as fallback/admin-only or removed from the default workflow
`mono-detect` correctly identifies pnpm workspaces and Turborepo, outputs `.agents/monorepo.json` with package list and dependency graph.
`mono-guard` validates lane-spec disjointness pre-flight and boundary compliance post-integration.
`mono-run` generates lane specs from roadmap execution profiles, runs `/mono-guard` pre-flight, dispatches parallel worktree agents for package-scoped steps, and runs cross-cutting steps serially.
`mono-run` stops all lanes on any lane failure and preserves worktree state.
`mono-ship` runs package-scoped and transitive-dependent tests/lint/build before delegating to `/ship`.
`pack-fixture-evidence` requires concrete fixture paths and input facts.
`packs/remotion` exists with mirrored Claude/Codex `youtube-format-research`, `video-script`, and `video-build` skill contracts.
`pnpm --dir tests test:live` runs the live layer3 suite when `LIVE_AGENT_TESTS=1` is set.
`pnpm --dir tests test` still runs only fast layer1 tests by default.
`pnpm bench --list-skills` confirms `analyze-sessions` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `analyze-sessions` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `analyze-sessions` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `analyze-sessions` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `benchmark-test-skill` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `benchmark-test-skill` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `benchmark-test-skill` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `benchmark-test-skill` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `benchmark-test-skill` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `benchmark-test-skill` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `benchmark-test-skill` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `content-programming` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `content-programming` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `content-programming` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `icon-handler` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `icon-handler` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `icon-handler` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `run` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `session-triage` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `session-triage` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `session-triage` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `session-triage` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `session-triage` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `session-triage` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `ship` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `ship` is known and reports its coverage status.
`pnpm bench --list-skills` confirms `spec-interview` is known and reports its coverage status.
`pnpm bench --list-skills` lists repository skills, not only custom layer4 targets.
`pnpm bench --skill analyze-sessions --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill analyze-sessions --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill analyze-sessions --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill analyze-sessions --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill content-programming --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill content-programming --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill content-programming --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill design-system --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill icon-handler --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill icon-handler --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill icon-handler --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill run --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill run --agent codex --runs 1 --chunk-size 1 --pause 0` completes through the generic benchmark path.
`pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill ship --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill ship --agent codex --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench --skill spec-interview --agent both --runs 3 --chunk-size 3 --pause 0` runs only after verify passes.
`pnpm bench` accepts `--agent claude`, `--agent codex`, and `--agent both`, defaulting to both.
`pnpm verify --skill analyze-sessions` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill analyze-sessions` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill analyze-sessions` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill analyze-sessions` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill benchmark-test-skill` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill benchmark-test-skill` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill benchmark-test-skill` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill benchmark-test-skill` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill benchmark-test-skill` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill benchmark-test-skill` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill benchmark-test-skill` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill content-programming` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill content-programming` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill content-programming` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill design-system` passes from `tests/`.
`pnpm verify --skill icon-handler` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill icon-handler` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill icon-handler` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill run` is attempted from `tests/`.
`pnpm verify --skill run` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill run` passes with layer2 skipped when no target-specific layer2 test exists.
`pnpm verify --skill session-triage` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill session-triage` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill session-triage` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill session-triage` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill session-triage` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill session-triage` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill ship` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill ship` passes or blocks benchmark execution with a recorded failure.
`pnpm verify --skill spec-interview` passes or blocks benchmark execution with a recorded failure.
`poketo-kanban --archive` cleans up Done/Punt cards older than 30 days with user confirmation
`README.md` and `docs/packs.md` show a mixed devtool/business-app example.
`run-*.json` can include bounded generated artifact content when a benchmark setup declares `qualityOutputPath`.
`scripts/generate-skills-showcase-data.mjs` writes committed generated data covering every tracked source skill.
`scripts/generate-skills-showcase-data.mjs` writes committed generated data covering every tracked source skill.
`scripts/generate-skills-showcase-github-data.mjs` writes committed proof data or an honest fallback without requiring secrets.
`scripts/generate-skills-showcase-github-data.mjs` writes committed proof data or an honest fallback without requiring secrets.
`scripts/pack.sh install`, `remove`, `refresh`, and `set-mode` do not drop existing `project_scopes` or `notes` when `jq` is available.
`scripts/pack.sh` writes lock owner metadata and removes stale dead-PID locks.
`scripts/validate-skills-showcase-data.sh` fails when generated showcase data is stale.
`scripts/validate-skills-showcase-data.sh` fails when generated showcase data is stale.
`search --board <id>` scopes results to specified board(s); repeatable flag works
`session-triage` exists for Claude and Codex with verification verdict, timeline, root cause, skill fix, validation plan, confidence, and evidence-gap outputs.
`sync` has a safe fixture plan using an explicit-permission disposable GitHub test repository.
`sync` has a safe fixture plan using an explicit-permission disposable GitHub test repository.
`sync` has a safe fixture plan using an explicit-permission disposable GitHub test repository.
`sync` has a safe fixture plan using an explicit-permission disposable GitHub test repository.
`targeted-skill-builder` distinguishes `$session-triage` for immediate incidents from `$analyze-sessions` for broad history analysis.
`tests/layer4/setups/tier1-workflows.setup.ts` requires separate `## Benchmark Metrics` rows for pass rate, p50 latency, total cost, and raw session path with the exact evidence tokens.
`youtube-description-optimizer` exists for both Claude and Codex.
`youtube-video-audit` exists for both Claude and Codex.
A clean benchmark-results matrix lists skills with persisted evaluated benchmark data, hard pass rates, quality scores, subjective review grades when present, and raw report paths.
A clean benchmark-results matrix lists skills with persisted evaluated benchmark data, hard pass rates, quality scores, subjective review grades when present, and raw report paths.
A committed coverage matrix lists every repository skill.
A committed coverage matrix lists every repository skill.
A consistent report-first approval rule is added where appropriate.
A conventional `favicon.ico` route is generated from the source asset.
A deterministic quality criterion penalizes unconditional skill-builder routing when a report says the existing contract is adequate or the issue is one-off agent noncompliance.
A reusable quality-gate contract exists and is referenced by the global mutation/shipping skills.
A reusable quality-gate contract exists and is referenced by the global mutation/shipping skills.
A validation script detects missing required ship-manifest fields and passes on a complete fixture.
A validation script detects missing required ship-manifest fields and passes on a complete fixture.
Add a deterministic quality criterion that penalizes unconditional skill-builder routing when a report says the existing contract is adequate or the issue is one-off agent noncompliance.
Add a fixture prompt requirement to verify `session-triage-report.md` exists in the project root after writing and create it before responding if missing.
Add a stop rule to affected mirrored Claude/Codex skill contracts.
Add bounded generated artifact persistence to `run-*.json` results when a setup declares `qualityOutputPath`.
Add deterministic layer1 coverage for post-write artifact handoff and intent-aware routing.
Add focused layer1 coverage for bold labels, runner-specific routes, and the broadened fixture.
Add focused layer1 coverage proving generated artifact content is persisted for later review.
Add focused layer1 regression coverage using the failing Claude artifact shapes while preserving rejection of missing remediation detail.
Add focused regression coverage for Codex interview cadence.
Add hard assertions and deterministic quality criteria that distinguish full strategy output from calendar-only output.
Add layer1 coverage for the root artifact requirement, required report sections, no-skill-change branch, and explicit "task checklist, not skill contract" language.
Add layer1 coverage that fails if the root source asset regresses to non-PNG placeholder text.
Add layer1 regression coverage for a facts-present but malformed metrics section.
Add layer1 regression coverage for accepted no-skill-change routing and rejected over-remediation routing.
Add layer1 regression coverage for hard assertions and quality route scoring.
Add layer1 regression coverage for rejected `$run` evidence-gate over-remediation and accepted operational `$run` routing.
Add layer1 regression coverage for the 2026-05-14 `icon-handler` evidence path.
Add layer1 regression coverage for the bad Claude pattern: claiming `/run` is unavailable or needs a gate, then routing to `/targeted-skill-builder run`.
Add layer1 regression coverage for the observed wrong final routes and correct runner-specific final routes.
Add layer1 regression coverage for the prompt label requirement, Claude `/series-spec`, Codex `$series-spec`, and quality scoring.
Add layer1 regression coverage in `tests/layer1/runner.test.ts`.
Add layer1 regression coverage proving concrete fixture references pass and generic evidence prose still fails.
Add layer1 regression coverage that accepts `Recommended next skill: none` and rejects reintroducing the hard-coded route assertion.
Add layer1 regression coverage that accepts a structured fixture report and rejects an exact-but-unstructured evidence dump.
Add regression coverage for `content-programming` quality and review evidence.
Add runner-specific route support to the Tier 2/3 global workflow benchmark helper.
Add the Codex `agents/openai.yaml` manifest.
Admin can list, search, copy active emails, and download CSV.
Agent recommends next task based on board state and project priorities
Agent-team lane deliverables include branch, commit SHA, validation evidence, and PR URL or an explicit blocker.
Agent-team lane deliverables include branch, commit SHA, validation evidence, and PR URL or an explicit blocker.
Agent-team planning includes a consolidation/PR review step after parallel lanes and before final validation/shipping.
Agent-team planning includes a consolidation/PR review step after parallel lanes and before final validation/shipping.
Agent-team write lanes require separate GitHub branches with deterministic names.
Agent-team write lanes require separate GitHub branches with deterministic names.
All codex skills have `agents/openai.yaml`
All new + existing tests pass (53 passed, 1 todo — target was 40+)
All phase tests pass.
All phase tests pass.
All phase tests pass.
All phase tests pass.
All phase tests pass.
All phase tests pass.
All phase tests pass.
All phase tests pass.
All phase tests pass.
All tests pass across all suites (77 tests)
Approved synthesized writes require a created/updated file list, verification/readback note, git status or dirty-artifact handoff, and explicit next action.
At least 5 database error path tests (insert failure, FK violation, connection error)
At least one iteration of versioning scheme documented and applied to 3+ skills
Audit current Skills Showcase routes and identify the surfaces still using legacy cards, rows, and summary grids.
Audit report-first approval gates across packs and identify skills with the same risk.
Audit the app framework, source asset dimensions, existing icon surfaces, and stale assets.
Backslash LIKE escape bug fixed with regression test
Benchmark coverage and a deterministic Next App Router icon audit fixture are registered.
Benchmark coverage metadata remains valid for the material skill behavior update.
Benchmark execution is skipped if verify fails.
Benchmark report and persisted Claude/Codex run evidence are inspected.
Benchmark reports include quality score summaries when a setup defines a quality evaluator.
Benchmark reports include quality score summaries when a setup defines a quality evaluator.
Board permissions and `board_actions` logging come from the canonical Poketo app layer rather than the standalone script
Bold Markdown next-route labels accepted by the default shipping contract pass route detection.
bootstrap-session.mjs and kanban.mjs share a single env path list
bootstrap-session.mjs has unit tests (10 tests, temp file fixtures)
Both skills require a journey/user-story placement decision and document when research or journey artifacts need updates.
Both skills require user-confirmed prioritization before roadmap or todo mutation.
Brainstorm output prompts use `$feature-interview` / `/feature-interview`.
Broad verified workflow gaps require one runner-native final `targeted-skill-builder` command with a concrete gap phrase.
Catalog search, filtering, result counts, asymmetry labels, and expandable rows work against generated skill data.
Check existing lessons and classify the failure as skill contract gap, benchmark harness gap, or runner noncompliance.
Check existing lessons and classify the failure as skill contract gap, benchmark harness gap, or runner noncompliance.
Check existing lessons and classify the failure as skill contract gap, benchmark harness gap, runner infrastructure issue, or runner noncompliance.
Check existing lessons and classify the issue as a skill contract gap, benchmark harness gap, runner infrastructure issue, or runner noncompliance.
Check existing lessons and prior `session-triage` benchmark triage history.
Check whether `icon-handler` is already represented in the frontend generated data.
Classify evidence into missed issues that tests should have caught, false positives/noisy detections, legitimate test detections, and infrastructure/tooling blockers.
Classify observed Claude/Codex failures by root-cause type with concrete examples.
Classify the failure as a skill contract gap, benchmark harness gap, runner infrastructure issue, or agent noncompliance.
Classify the issue as a skill contract gap, benchmark harness gap, runner infrastructure issue, or agent noncompliance.
Classify the smallest contract change that makes report-first approval behavior consistent.
Classify whether the failure is a skill contract gap, benchmark harness gap, or runner noncompliance.
Classify whether the failure is a skill contract gap, benchmark harness gap, or runner noncompliance.
Classify whether the failure is a skill contract gap, benchmark harness gap, or runner noncompliance.
Claude and Codex benchmark sessions persist under separate raw run directories.
Claude and Codex expected routes use their native conventions: `/targeted-skill-builder` and `$targeted-skill-builder`.
Claude and Codex kanban skills use the same app-layer write path for normal board operations
Cleanup and infrastructure-block handling are documented for the disposable repository workflow.
Cleanup and infrastructure-block handling are documented for the disposable repository workflow.
Codex `agents/openai.yaml` manifest exists.
Codex and Claude `ship` contracts forbid routine self-routing after ship completion.
Codex has `global/codex/codebase-status/agents/openai.yaml`.
Codex has `global/codex/feature-interview/agents/openai.yaml`.
Codex has `global/codex/install-agentic-skills/agents/openai.yaml`.
Codex has `global/codex/targeted-skill-builder/agents/openai.yaml`.
Codex kanban skills no longer reference `~/.claude/skills/...` paths
Codex skills with explicit grouped-question or Claude-only AskUserQuestion language are updated to one-primary-question-per-turn guidance.
Commit and push intended changes.
Commit and push intended changes.
Commit and push intended changes.
Commit and push intended changes.
Compare `analyze-sessions` Claude/Codex contracts against the benchmark setup, hard assertions, and quality criteria.
Compare mirrored `benchmark-test-skill` contracts and the tier1 benchmark setup expectations.
Compare mirrored `content-programming` contracts against benchmark setup expectations.
Compare mirrored `content-programming` contracts against the benchmark setup and quality rubric.
Compare mirrored `icon-handler` contracts against benchmark setup expectations.
Compare mirrored `icon-handler` contracts against benchmark setup expectations.
Compare mirrored `icon-handler` contracts against the benchmark setup expectations.
Compare mirrored `session-triage` contracts against benchmark setup expectations.
Compare mirrored `session-triage` contracts against the tier1 benchmark setup expectations.
Compare mirrored `session-triage` contracts against the tier1 benchmark setup expectations.
Compare those surfaces with the existing Playful Lab workflow consolidation artifact.
Confirm `$benchmark-test-skill` is the active workflow and `analyze-sessions` is only the benchmark target.
Confirm `$benchmark-test-skill` is the active workflow and `analyze-sessions` is only the benchmark target.
Confirm `$benchmark-test-skill` is the active workflow and `analyze-sessions` is only the benchmark target.
Confirm `$benchmark-test-skill` is the active workflow and `analyze-sessions` is only the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `content-programming` is only the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `content-programming` is the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `content-programming` is the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `content-programming` is the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `icon-handler` is the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `icon-handler` is the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `icon-handler` is the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `icon-handler` is the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `session-triage` is the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `session-triage` is the target skill argument.
Confirm `$benchmark-test-skill` is the active workflow and `ship` is only the benchmark target.
Confirm `$benchmark-test-skill` remains the active workflow and `analyze-sessions` is only the benchmark target.
Confirm `benchmark-test-skill` is a known benchmark harness target and record its coverage status.
Confirm `benchmark-test-skill` is a known benchmark harness target and record its coverage status.
Confirm `benchmark-test-skill` is a known benchmark harness target and record its coverage status.
Confirm `benchmark-test-skill` is a known benchmark harness target and record its coverage status.
Confirm `session-triage` is a known benchmark harness target and record its coverage status. `coverage=custom`, `setup=tests/layer4/setups/tier1-workflows.setup.ts`.
Confirm `session-triage` is a known benchmark harness target and record its coverage status. ✓ `coverage=custom`, `setup_path=tests/layer4/setups/tier1-workflows.setup.ts`, `priority_tier=1`, `fixture_type=incident-report-fixture` (bench-coverage.ts:427-432).
Confirm `session-triage` is a known benchmark harness target and record its coverage status. ✓ `coverage=custom`, `setup=tests/layer4/setups/tier1-workflows.setup.ts`.
Confirm `session-triage` is a known benchmark harness target and record its coverage status. ✓ `coverage=custom`, `setup=tests/layer4/setups/tier1-workflows.setup.ts`.
Confirm `session-triage` is a known benchmark harness target and record its coverage status. ✓ `coverage=custom`, `setup=tests/layer4/setups/tier1-workflows.setup.ts`.
Confirm existing benchmark setup ownership instead of creating a new skill.
Confirm existing overlap: `$hygiene` covers structural audits but not icon conversion/metadata correction.
Confirm existing-skill overlap and choose a benchmark harness update rather than mirrored `analyze-sessions` contract changes.
Confirm existing-skill overlap and choose an `analyze-sessions` update rather than a new skill.
Confirm existing-skill overlap: update benchmark harness classification, not mirrored `icon-handler` skill contracts.
Confirm existing-skill overlap: update benchmark setup/rubric, not mirrored `icon-handler` skill contracts.
Confirm no existing skill already owns this narrower behavior.
Confirm the fix belongs in benchmark harness coverage, not mirrored `content-programming` skill contracts or a new skill.
Confirm the fix belongs in benchmark setup/rubric coverage, not the mirrored `icon-handler` skill contracts.
Confirm the fix belongs in the benchmark fixture and layer1 setup tests, not the `session-triage` skill contract.
Confirm the fix belongs in the benchmark fixture/rubric and layer1 setup tests, not the `session-triage` skill contract.
Confirm the fix belongs in the benchmark fixture/rubric and layer1 setup tests, not the mirrored `session-triage` skill contracts.
Confirm the fix belongs in the existing benchmark fixture/rubric and layer1 setup tests, not the `session-triage` skill contract.
Confirm the fix belongs in the existing benchmark-test-skill fixture and layer1 setup tests, not a new skill.
Confirm the fix belongs in the existing benchmark-test-skill fixture/setup, not a new skill.
Confirm the fix belongs in the existing pack benchmark setup, not mirrored `content-programming` skill contracts or a new skill.
Confirm the gap belongs in benchmark harness persistence, not the `benchmark-agent-review` skill contract.
Confirm the narrow gap from the current `$analyze-sessions` findings and relevant lessons.
Conventional public install/touch icon surfaces are present for static and app-install references.
Copy the new source image to Next App Router icon surfaces and conventional public install/touch icon paths.
Create mirrored `global/claude/icon-handler` and `global/codex/icon-handler` skill contracts.
Create mirrored Claude/Codex `customer-lifecycle` pack skills for the full lifecycle chain.
create-list command has dedicated test coverage
Creator-media docs/reference lists include `youtube-description-optimizer` in the packaging flow.
Creator-media docs/reference lists include `youtube-video-audit` and preserve the existing channel audit flow.
Creator-media docs/reference lists include the three new skills in discovery and default flow.
Creator-media handoffs point to the `remotion` pack when Remotion implementation is the next step.
Creator-media next-skill routing includes `youtube-description-optimizer` between title/thumbnail audit and portfolio.
Cross-check local Claude and Codex session histories for benchmark-test-skill setup parity context.
Current benchmark report and persisted Claude/Codex run evidence are inspected.
Deliverables include a durable leave-behind with evidence, gotchas, journey placement, documentation changes, priority decision, and exact next command.
Dependency graph script detects broken cross-references between skills
Desktop, tablet, and mobile layouts avoid overlap and meet the UI spec's accessibility states.
Deterministic layer1 contract tests lint the mirrored skill text for these requirements.
Deterministic layer1 coverage catches future loss of the self-route guard.
Deterministic layer1 coverage protects the creator-media contract language.
Deterministic quality scoring distinguishes full strategy output from calendar-only output.
Discovery docs and `skills` grouping include `session-triage`.
Discovery docs and generated Skills Showcase assets are refreshed.
Discovery docs include `codebase-status`.
Documentation is current; no missing or stale research, spec, roadmap, or task artifacts found.
Duplicate signup behavior is idempotent.
Each evaluated output is graded against the agent-review rubric separately from deterministic benchmark metrics.
Each evaluated output is graded against the agent-review rubric separately from deterministic benchmark metrics.
Each evaluated output is graded against the agent-review rubric separately from deterministic benchmark metrics.
Each evaluated output is graded against the agent-review rubric separately from deterministic benchmark metrics.
Each evaluated output is graded against the agent-review rubric separately from deterministic benchmark metrics.
Each evaluated output is graded against the agent-review rubric separately from deterministic benchmark metrics.
Each evaluated output is graded against the agent-review rubric separately from deterministic benchmark metrics.
Each evaluated output is graded against the agent-review rubric separately from deterministic benchmark metrics.
Each evaluated output is graded against the agent-review rubric without merging subjective scores into deterministic benchmark metrics.
Edge case tests added: unicode card names, LIKE metacharacter queries (%, _, backslash as todo), moving card to same list, archiving already-archived card
Every curated workflow has selectable text, steps, artifacts, and a non-video browser-native animation or static reduced-motion fallback.
Existing creator-media routing orders include the three new external-video research lanes.
Existing creator-media skills are updated rather than adding a duplicate meta-skill.
Existing explicit write/update modes remain available after approval.
Existing lessons are checked for relevant routing/rubric patterns.
Existing lessons are checked for runner-route and benchmark workflow patterns.
Existing overlap is checked; `$hygiene` covers structural audits but not icon conversion/metadata correction.
Existing tests still pass; new tests added (83 total)
Existing-skill overlap confirms the benchmark pack skill owns this behavior; no duplicate broad lint skill is added.
Existing-skill overlap confirms the fix belongs in `ship`, not a new skill.
Existing-skill overlap confirms the fix belongs in the benchmark fixture/setup, not a new skill.
Existing-skill overlap confirms the fix belongs in the benchmark harness/setup, not a new skill.
Existing-skill overlap confirms the fix belongs in the benchmark harness/setup, not a new skill.
Existing-skill overlap confirms the fix belongs in the benchmark harness/setup, not a new skill.
Extend `tests/harness/bench-runner.ts` infrastructure-block classification for non-zero `Could not process image` API errors.
Extend layer1 coverage for the post-write existence check while preserving the no-skill-change branch.
Extend the `no-over-remediation-route` quality criterion to penalize evidence-gate or contract-change recommendations when the output also says the existing rule is adequate.
Extract evaluated run outputs and exclude infrastructure-blocked runs.
Extract red/green, benchmark, verify, test failure, false-positive, and missed-issue signals from full available history.
Extract retained generated report artifacts and benchmark context from each evaluated run.
Extract retained generated triage artifacts and benchmark context from each evaluated run.
Extract retained generated triage artifacts and benchmark context from each evaluated run.
Extract retained generated triage artifacts and benchmark context from each evaluated run. ✓ Claude retained stdout summaries; Codex retained full report text in transcripts.
Final report includes highest-impact automations and a concrete next route.
Final validation covers generated data freshness, responsive UI, accessibility/reduced-motion behavior, links, and static-route reloads.
Fix generated benchmark data so current output-quality rows and subjective review summaries are included.
Focused frontend and data validation passes.
Focused layer1 benchmark setup/quality tests pass.
Focused layer1 coverage proves generated artifact content is persisted for later review.
Focused layer1 setup/quality tests, benchmark coverage, target verify, install/skill contract checks, and whitespace validation pass.
Focused layer1 setup/quality tests, required skill checks, benchmark coverage, target verify, Codex smoke benchmark, and whitespace validation pass.
Focused layer1 tests, benchmark coverage, required skill validation, and whitespace validation pass.
Focused layer1 tests, benchmark coverage, verify, and one-run Codex benchmark smoke pass.
Focused layer1 tests, benchmark coverage, verify, smoke benchmarks, and whitespace validation pass.
Focused layer1 tests, install, skill dependency/version/routing checks, targeted retained-artifact checks, and whitespace validation pass; benchmark coverage blocker is recorded.
Focused layer1, benchmark coverage, target verify, smoke benchmark, and whitespace validation pass.
Focused layer1, required skill checks, benchmark coverage, target verify, Codex smoke benchmark, and whitespace validation pass.
Focused smoke tests cover live-lock waiting and stale-lock recovery.
Focused tests/build checks pass, and results are recorded in `tasks/todo.md`.
Focused validation passes without adding a database, video, Remotion, runtime API, GitHub Actions, or unnecessary root dependencies.
Focused validation passes without adding a database, video, Remotion, runtime API, GitHub Actions, or unnecessary root dependencies.
Focused validation passes.
Follow-up routing recommends the correct creator-media strategy skill from dossier findings.
Follow-up routing recommends the correct creator-media strategy skill from dossier findings.
Follow/about route converts proof interest into G, LexCorp, YouTube, X/Twitter, Discord, GitHub, and newsletter actions.
Fresh benchmark report and persisted Claude/Codex run artifacts are inspected.
Fresh benchmark report and persisted Claude/Codex run evidence are inspected.
Fresh benchmark report and persisted Claude/Codex run evidence are inspected.
Fresh benchmark report and persisted Claude/Codex run evidence are inspected.
Fresh benchmark report and persisted Claude/Codex run evidence are inspected.
Fresh benchmark report and persisted failed-run evidence are inspected.
Fresh benchmark report and persisted failed-run evidence are inspected.
Full available Claude/Codex history is parsed for `mobile-ideas` activity.
Full available Claude/Codex prompt history and rich Codex sessions are scanned for interview-question cadence evidence.
Full available Claude/Codex prompt history and rich Codex sessions are scanned for repository-scoped red/green, benchmark, verify, false-positive, and missed-test signals.
Full kanban lifecycle works: brainstorm creates Backlog cards → spec-interview specs them → roadmap moves to Todo → run moves to In Progress (with conflict warnings) → ship/ship-end moves to Done or Punt
Future skill creation/update workflows require benchmark coverage handling.
Future skill creation/update workflows require benchmark coverage handling.
Future skill creation/update workflows require benchmark quality-rubric handling where practical.
Future skill creation/update workflows require benchmark quality-rubric handling where practical.
Generate a conventional `app/favicon.ico` from the new source asset.
Generated assets are validated, then committed and pushed on `master`.
Generated showcase data is refreshed, validation passes, and results are recorded in `tasks/todo.md`.
Generated Skills Showcase data is refreshed and validated because curated benchmark evidence changed.
Generated Skills Showcase data is refreshed and validated because curated review evidence changed.
Generated Skills Showcase data is refreshed and validated because curated review evidence changes.
Generated Skills Showcase data is refreshed and validated because curated review evidence changes.
Generated Skills Showcase data is refreshed and validated because curated review evidence changes.
Generated Skills Showcase data is refreshed and validated if curated benchmark evidence changes.
Generated Skills Showcase data is refreshed and validated if curated benchmark evidence changes.
Generated Skills Showcase data is refreshed and validated if curated benchmark evidence changes.
Generated Skills Showcase data is refreshed and validated if curated benchmark evidence changes.
Generic fallback remains available until all skills have custom coverage.
Generic fallback remains available until all skills have custom coverage.
GitHub/open-source proof telemetry is visible and does not claim live LexCorp product metrics.
Grade each evaluated output against the agent-review rubric separately from deterministic benchmark metrics.
Grade each evaluated output against the agent-review rubric separately from deterministic benchmark metrics.
Grade each evaluated output against the agent-review rubric separately from deterministic benchmark metrics.
Grade each evaluated output against the agent-review rubric separately from deterministic benchmark metrics.
Grade each evaluated output against the agent-review rubric separately from deterministic benchmark metrics.
Grade each evaluated output against the agent-review rubric separately from deterministic benchmark metrics.
Grade each evaluated output against the agent-review rubric without merging scores into deterministic benchmark metrics.
Grade each evaluated output against the agent-review rubric without merging scores into deterministic benchmark metrics.
Grade each evaluated output against the agent-review rubric without merging scores into deterministic benchmark metrics.
Grade each evaluated output against the agent-review rubric without merging scores into deterministic benchmark metrics. ✓ 6 evaluated outputs scored separately from hard pass rates.
Grade generated artifacts against the agent-review rubric.
Hard assertions fail suffixed final commands such as `$targeted-skill-builder run post-doc-edit validation and lessons capture gate for Codex`.
Hard assertions reject calendar-only output for the full-contract fixture.
Hard assertions require structured report headings in addition to exact fixture evidence.
Hard-move `journey-map` out of `business-discovery`, including stale agent metadata.
Identify existing-skill overlap: update existing custom benchmark setup, not the `session-triage` skill contract.
Identify existing-skill overlap: update the existing `icon-handler` benchmark setup, not the mirrored `icon-handler` skill contracts.
Identify existing-skill overlap: update the existing `session-triage` custom benchmark setup, not the mirrored `session-triage` skill contracts.
Identify research skills that currently require direct file writes.
Identify why the site payload still points at the older 2026-05-13 benchmark report.
If verify passes, run `pnpm bench --skill analyze-sessions --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill analyze-sessions --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill analyze-sessions --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill analyze-sessions --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill analyze-sessions --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill benchmark-test-skill --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill content-programming --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill content-programming --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill content-programming --agent both --runs 3 --chunk-size 3 --pause 0`. Claude 0/3, Codex 3/3, no blocked runs.
If verify passes, run `pnpm bench --skill content-programming --agent both --runs 3 --chunk-size 3 --pause 0`. Claude 3/3 and Codex 3/3, no blocked runs.
If verify passes, run `pnpm bench --skill icon-handler --agent both --runs 3 --chunk-size 3 --pause 0`. ✓ Claude 1/3 (33.3%), Codex 3/3 (100.0%), no blocked runs.
If verify passes, run `pnpm bench --skill icon-handler --agent both --runs 3 --chunk-size 3 --pause 0`. Claude 2/3, Codex 2/3, no blocked runs.
If verify passes, run `pnpm bench --skill icon-handler --agent both --runs 3 --chunk-size 3 --pause 0`. Claude 2/3, Codex 3/3, no blocked runs.
If verify passes, run `pnpm bench --skill icon-handler --agent both --runs 3 --chunk-size 3 --pause 0`. Claude 3/3, Codex 3/3, no blocked runs.
If verify passes, run `pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0`.
If verify passes, run `pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0`. ✓ Claude 0/2 (0.0%), 1 blocked; Codex 3/3 (100.0%), 0 blocked.
If verify passes, run `pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0`. ✓ Claude 3/3 (100.0%), Codex 2/3 (66.7%), no blocked runs.
If verify passes, run `pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0`. ✓ Claude 3/3 (100.0%), Codex 2/3 (66.7%), no blocked runs.
If verify passes, run `pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0`. ✓ Claude 3/3 (100.0%), Codex 3/3 (100.0%), no blocked runs.
If verify passes, run `pnpm bench --skill session-triage --agent both --runs 3 --chunk-size 3 --pause 0`. ✓ Claude 3/3 (100.0%), Codex 3/3 (100.0%), no blocked runs.
Increase `no-over-remediation-route` weight so unconditional skill/contract edit routes fall below the quality threshold.
Inspect `benchmark/test-content-programming-2026-05-14.md` and persisted Claude/Codex run evidence.
Inspect `benchmark/test-icon-handler-2026-05-13.md` and persisted failed Claude run evidence.
Inspect `benchmark/test-icon-handler-2026-05-14.md` and persisted failed run evidence for `icon-handler-claude-86ed23d1`.
Inspect `benchmark/test-icon-handler-2026-05-14.md` and persisted failed run evidence.
Inspect `report.md`, `report.json`, run JSON files, and retained `icon-audit.md` artifacts.
Inspect existing benchmark evidence generation and frontend rendering.
Inspect fresh benchmark report and persisted failed run evidence.
Inspect retained generated `pack-benchmark-output.md` artifacts, fixture facts, benchmark report, and benchmark setup context.
Inspect retained generated `pack-benchmark-output.md` artifacts, fixture facts, benchmark report, and mirrored skill contract context.
Inspect retained generated `session-analysis.md` artifacts and benchmark context for each evaluated run.
Inspect retained generated `session-analysis.md` artifacts and benchmark context for each evaluated run.
Inspect retained generated `session-analysis.md` artifacts and benchmark context for each evaluated run.
Inspect the 2026-05-13 benchmark report and persisted Claude/Codex run evidence.
Inspect the benchmark report and persisted Claude/Codex run evidence.
Inspect the benchmark report, persisted Claude/Codex run data, and retained output artifacts.
Inspect the current benchmark report and persisted Claude/Codex run evidence.
Inspect the fresh benchmark report and persisted Claude/Codex run evidence.
Inspect/proof UI links to public GitHub receipts and validation artifacts.
Install, skill integrity, coverage, showcase, target verify, targeted `rg`, and whitespace validation pass.
install.sh has vitest test suite covering happy path + error cases (8 tests)
Invalid `--board` ID exits with error (not silent empty results)
Invalid emails and database failures produce appropriate public UI states without leaking internals.
Inventory available Claude/Codex prompt history and rich Codex sessions for this repository.
Inventory recent persisted benchmark reports and raw both-agent run artifacts.
Keep the fix scoped to the workflows route and Playful Lab/TUI styling unless evidence shows shared CSS is the root cause.
Keep the pilot pattern usable as the reference for later catalog, packs, benchmarks, proof, and follow refactors.
Lane-spec artifact follows JSON + Markdown mirror pattern with lifecycle tracking.
Latest Claude and Codex run directories are resolved from `benchmark/test-analyze-sessions-2026-05-15.md`.
Latest Claude and Codex run directories are resolved from `benchmark/test-analyze-sessions-2026-05-15.md`.
Latest Claude and Codex run directories are resolved from `tests/benchmarks/runs/analyze-sessions-*`.
Latest Claude and Codex run directories are resolved from `tests/benchmarks/runs/benchmark-test-skill-*`.
Latest Claude and Codex run directories are resolved from `tests/benchmarks/runs/content-programming-*`.
Latest Claude and Codex run directories are resolved from `tests/benchmarks/runs/icon-handler-*`.
Latest Claude and Codex run directories are resolved from `tests/benchmarks/runs/session-triage-*`.
Latest Claude and Codex run directories are resolved from `tests/benchmarks/runs/session-triage-*`.
Latest Claude and Codex run directories are resolved from `tests/benchmarks/runs/session-triage-*`.
Layer1 coverage guards against reintroducing the misleading `run-codex-abc` fixture.
Layer1 covers the root artifact requirement, required report sections, no-skill-change branch, and explicit "task checklist, not skill contract" language.
Layer1 proves assertions are skipped for this image-processing API error.
Layer1 regression coverage accepts `Recommended next skill: none` and rejects reintroducing the hard-coded route assertion.
Layer1 regression coverage accepts no-skill-change routing and rejects over-remediation routing.
Layer1 regression coverage asserts the post-write existence check and preserves the no-skill-change branch.
Layer1 regression coverage protects the contract and benchmark rubric behavior.
Layer1 regression coverage protects the prompt, hard assertion, and quality behavior.
Layer1 regression coverage protects the route parser and `analyze-sessions` fixture behavior.
Layer1 regression coverage protects title-case agent rows in benchmark reports.
Layer1 regression coverage proves concrete fixture references pass and generic evidence prose still fails.
Layer1 regression coverage rejects `$run` evidence-gate over-remediation while preserving accepted operational `$run` routing.
Layer1 regression coverage rejects malformed metric-table reports while preserving the existing structured passing fixture.
LinkedIn evidence guidance is present in mirrored creator-media skills or a dedicated mirrored LinkedIn skill, depending on Phase 12/13 implementation shape.
LinkedIn evidence guidance is present in mirrored creator-media skills or a dedicated mirrored LinkedIn skill, depending on Phase 12/13 implementation shape.
LinkedIn records normalize into the shared evidence schema and dossier.
LinkedIn records normalize into the shared evidence schema and dossier.
Live tests can target Claude, Codex, or both through documented environment flags.
Live tests use temporary repos and do not mutate tracked repo files.
Local app validation, database-contract checks, admin access checks, and whitespace checks pass.
Locate latest persisted `session-triage` benchmark runs for Claude and Codex.
Mirrored `analyze-sessions` contracts are compared against benchmark setup and quality expectations.
Mirrored `benchmark-test-skill` contracts and the tier1 benchmark setup expectations are compared.
Mirrored `content-programming` contracts are compared against benchmark setup and quality expectations.
Mirrored `content-programming` contracts are compared against benchmark setup expectations.
Mirrored `global/claude/icon-handler` and `global/codex/icon-handler` skill contracts exist.
Mirrored `icon-handler` contracts are compared against the Tier 2/3 benchmark setup expectations.
Mirrored `icon-handler` contracts are compared against the Tier 2/3 benchmark setup expectations.
Mirrored `session-triage` contracts are compared against benchmark setup expectations.
Mirrored `session-triage` contracts are compared against the tier1 benchmark setup expectations.
Mirrored `session-triage` contracts are compared against the tier1 benchmark setup expectations.
Mirrored benchmark-test-skill contracts are compared for drift.
Mirrored benchmark-test-skill contracts explain the distinction between generic smoke evidence and deep domain-quality evidence.
Mirrored benchmark-test-skill contracts require the eligibility preflight before verify.
Mirrored Claude/Codex `benchmark-test-skill` contracts are compared.
Mirrored Claude/Codex `benchmark-test-skill` contracts explicitly require command resolution, eligibility preflight, report verification, infrastructure-blocked classification, and final next-step routing.
Mirrored Claude/Codex `youtube-competitive-research` skill contracts exist.
Mirrored Claude/Codex `youtube-format-research` skill contracts exist.
Mirrored Claude/Codex `youtube-vid-research` skill contracts exist.
Mirrored Claude/Codex skill contracts exist for all v1 skills.
Mirrored Claude/Codex skill presence and relevant lessons are checked.
Mobile widths avoid horizontal page overflow from chips, commands, benchmark/demo pre blocks, controls, or notebook panels.
Monorepo lane-spec guidance carries the same branch/PR requirements.
Monorepo lane-spec guidance carries the same branch/PR requirements.
Mutation-capable skills patched by this phase explicitly require next-step routing in their final response.
Neon stores subscriber records with `email`, `status`, `source_page`, `consent_text_version`, `created_at`, and `updated_at`.
Newsletter/email capture works with a configured provider endpoint or clearly degrades to a non-collecting fallback.
Next App Router icon surfaces use the new source asset.
Next-skill routing gives priority to immediate user intent such as strategy refresh, recording prep, upload prep, performance review, or owner-analytics/manual blocker work.
No `analyze-session` skill or alias is created.
No broken skill cross-references in the repo
No credentials in tracked files, Neon password rotated
No GitHub Actions are created, modified, or recommended.
No GitHub Actions are created, modified, or recommended.
No GitHub Actions are created, modified, or recommended.
No GitHub Actions are created, modified, or recommended.
No GitHub Actions are created, modified, or recommended.
No GitHub Actions are created, modified, or recommended.
No GitHub Actions are created, modified, or recommended.
No kanban skill requires `POKETOWORK_DATABASE_URL` for standard usage
No regressions in previous phase tests.
No regressions in previous phase tests.
No regressions in previous phase tests.
No regressions in previous phase tests.
No regressions in previous phase tests.
No regressions in previous phase tests.
No regressions in previous phase tests.
No regressions in previous phase tests.
No regressions in previous phase tests.
Non-trivial source changes require targeted `quality-sweep audit`, `expert-review`, or an explicitly justified equivalent adversarial review before commit/push.
Non-trivial source changes require targeted `quality-sweep audit`, `expert-review`, or an explicitly justified equivalent adversarial review before commit/push.
Non-zero runner output containing `Could not process image` is marked infrastructure-blocked with a clear reason.
Output requires overview, history signal, recent work, current status, outstanding work, risks/drift, and a concrete next command.
Pack docs describe stale-lock behavior.
Pack docs route non-YouTube creator-media work through the foundation before platform-specific skills.
Pack docs route non-YouTube creator-media work through the foundation before platform-specific skills.
Pack docs, skill references, pack normalization, and relevant tests know about `remotion`.
Pack map distinguishes global core, packs, overlays, and compatibility aliases with usable mobile behavior.
Pack skills are covered by custom Codex benchmark setups or explicit blocked statuses.
Pack skills are covered by custom Codex benchmark setups or explicit blocked statuses.
Pack skills have quality rubrics where deterministic signals are practical, or explicit blocked/deferred quality notes.
Pack skills have quality rubrics where deterministic signals are practical, or explicit blocked/deferred quality notes.
Pack structure registered in README, `docs/skills-reference.md`, and `docs/packs.md`.
Pack workflow prompts require literal accepted next-route labels.
Patch the showcase data generator to parse current benchmark summary rows robustly.
Plan-mode `request_user_input` guidance remains allowed only for one material decision with 2-3 concrete options, not for batching unrelated questions.
Quality evaluator tests prove that strong fixture outputs pass and degraded/generic/hallucinated outputs fail.
Quality evaluator tests prove that strong fixture outputs pass and degraded/generic/hallucinated outputs fail.
Rate limit and quota outputs are reported as infrastructure-blocked runs outside the evaluated skill pass rate.
Read `$analyze-sessions` guidance and current task/lesson docs.
Read relevant lessons and current `session-triage` benchmark setup.
Read relevant lessons and inspect the workflows route/component tree.
Read relevant lessons and latest triage report.
Read relevant lessons and the `content-programming` benchmark agent-review remediation.
Read relevant lessons, fresh triage report, current benchmark fixture, and layer1 setup coverage.
Read relevant lessons, triage evidence, mirrored `content-programming` contracts, and pack benchmark setup overlap.
Read relevant lessons, triage report, benchmark fixture, and layer1 route coverage.
Read relevant lessons, triage report, current benchmark fixture, and layer1 setup coverage.
Read the targeted-skill-builder contract and current lessons.
Real examples from history are cited with counts and clear limitations.
Recommendations distinguish skills, agents, plugins/integrations, and standing instructions.
Record coverage status and setup path from the harness list output.
Record coverage status and setup path from the harness list output. `coverage=custom`, setup `tests/layer4/setups/tier23-global-workflows.setup.ts`.
Record coverage status: `custom`, setup `tests/layer4/setups/tier1-workflows.setup.ts`.
Record coverage status: `custom`, setup `tests/layer4/setups/tier1-workflows.setup.ts`.
Record extracted token decisions and accessibility findings in `design-system-interview.md`.
Record investigation results and ship intended changes on `master`.
Record investigation results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended benchmark/task changes on `master`.
Record results here, then commit and push intended benchmark/task changes on `master`.
Record results here, then commit and push intended benchmark/task changes on `master`.
Record results here, then commit and push intended benchmark/task changes on `master`.
Record results here, then commit and push intended benchmark/task changes on `master`.
Record results here, then commit and push intended benchmark/task changes on `master`.
Record results here, then commit and push intended benchmark/task changes on `master`.
Record results here, then commit and push intended benchmark/task changes on `master`.
Record results here, then commit and push intended benchmark/task changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes on `master`.
Record results here, then commit and push intended changes.
Record results here, then commit and push intended changes.
Record results here, then commit and push intended fixture/task/generated changes on `master`.
Record results here, then commit and push intended harness/task changes on `master`.
Record results here, then commit and push intended review/task changes on `master`.
Record results here, then commit and push intended review/task changes on `master`.
Record results here, then commit and push intended review/task changes on `master`.
Record results here, then commit and push intended review/task changes on `master`.
Record results here, then commit and push intended triage/task changes on `master`.
Record results here, then commit and push intended triage/task changes on `master`.
Record results here, then commit and push intended triage/task changes on `master`.
Record results in `tasks/todo.md`, then commit and push on `master`.
Record validation results, then commit and push intended changes on `master`.
Redaction and privacy handling are documented before analysis.
Redaction and privacy handling are documented before analysis.
Refresh generated showcase data because tracked `SKILL.md`/`PACK.md` behavior changes.
Refresh generated Skills Showcase data and validate pack install behavior.
Refresh generated Skills Showcase data because curated benchmark evidence changed.
Refresh generated Skills Showcase data because curated benchmark evidence changes.
Refresh generated Skills Showcase data because curated review evidence changed.
Refresh generated Skills Showcase data because curated review evidence changes.
Refresh generated Skills Showcase data because curated review evidence changes.
Refresh generated Skills Showcase data because curated review evidence changes.
Refresh generated Skills Showcase data because curated review evidence changes.
Refresh generated Skills Showcase data if curated benchmark evidence changes.
Refresh generated Skills Showcase data if curated benchmark evidence changes.
Refresh generated Skills Showcase data if curated benchmark evidence changes.
Refresh generated Skills Showcase data if curated benchmark evidence changes.
Refresh generated Skills Showcase data if curated benchmark evidence changes.
Refresh Skills Showcase generated data if tracked skill behavior changes.
Regenerate showcase data assets and benchmark results matrix.
Regenerate showcase data, validate, verify the UI locally, then commit and push on `master`.
Register benchmark coverage and a deterministic Next App Router icon audit fixture.
Regression coverage protects the Codex cadence language.
Remove the hard-coded `$targeted-skill-builder` route requirement from the current one-off noncompliance fixture.
Render subjective review score/report evidence in catalog benchmark panels and the benchmarks table.
Repeated prompts and multi-step workflow patterns are grouped with counts and examples.
Replace the ASCII root `calc-mascot-icon.png` fixture with a tiny valid PNG source asset while preserving stale existing icon-surface evidence.
Replace the top `/workflows` legacy selector and `blueprint-panel` walkthrough with a single Playful Lab console.
Report counts, trend evidence, parity verdict, and recommended next route.
Reports distinguish explicit source evidence from inferred source labels and avoid unsupported runner ownership.
Repository artifacts, commits, and planning outputs are inventoried.
Representative one-run Codex benchmarks produce quality-scored reports for at least `run`, `investigate`, `design-system`, and one pack skill.
Representative one-run Codex benchmarks produce quality-scored reports for at least `run`, `investigate`, `design-system`, and one pack skill.
Required skill-builder validation and one-run Claude benchmark smoke pass or are recorded with a clear infrastructure block.
Required validation passes and results are recorded in `tasks/todo.md`.
Required validation passes and results are recorded in `tasks/todo.md`.
Required validation passes and results are recorded in `tasks/todo.md`.
Required validation passes and results are recorded in `tasks/todo.md`.
Required validation passes and results are recorded in `tasks/todo.md`.
Required validation passes and results are recorded in `tasks/todo.md`.
Research skills with direct-write contracts are identified.
Resolve `$benchmark-agent-review session-triage` as the active workflow.
Resolve latest Claude and Codex `icon-handler` benchmark run directories.
Resolve latest Claude and Codex run directories from `tests/benchmarks/runs/content-programming-*`.
Resolve latest Claude and Codex run directories from `tests/benchmarks/runs/content-programming-*`.
Resolve latest Claude and Codex run directories from `tests/benchmarks/runs/session-triage-*`.
Resolve the latest Claude and Codex run directories from `benchmark/test-analyze-sessions-2026-05-15.md`.
Resolve the latest Claude and Codex run directories from `tests/benchmarks/runs/analyze-sessions-*`.
Resolve the latest Claude and Codex run directories from `tests/benchmarks/runs/benchmark-test-skill-*`.
Resolve the latest Claude and Codex run directories from `tests/benchmarks/runs/session-triage-*`.
Resolve the latest Claude and Codex run directories from `tests/benchmarks/runs/session-triage-*`. ✓ `session-triage-claude-e5f0772b` and `session-triage-codex-374ad6f0`.
Resolve the latest Claude and Codex run directories from the fresh benchmark report.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Results are recorded in `tasks/todo.md`; if tracked files change, commit and push on `master`.
Results are recorded in `tasks/todo.md`.
Retained generated `icon-audit.md` artifacts and benchmark context are inspected for each evaluated run.
Retained generated `pack-benchmark-output.md` artifacts and benchmark context are inspected for each evaluated run.
Retained generated `session-analysis.md` artifacts and benchmark context are inspected for each evaluated run.
Retained generated `session-analysis.md` artifacts and benchmark context are inspected for each evaluated run.
Retained generated `session-analysis.md` artifacts and benchmark context are inspected for each evaluated run.
Retained generated report artifacts and benchmark context are inspected for each evaluated run.
Retained generated triage artifacts and benchmark context are extracted for each evaluated run.
Retained generated triage artifacts and benchmark context are inspected for each evaluated run.
Retained generated triage artifacts and benchmark context are inspected for each evaluated run.
Roadmap unspecced-idea handling prefers feature-interview triage and keeps spec-interview for confirmed full-spec work.
Route assertions accept `Next command`, `Recommended next command`, `Recommended next skill`, and `Next work` plus `Recommended next command`.
Run `pnpm bench --list-skills` and confirm `analyze-sessions` is known, including coverage status.
Run `pnpm bench --list-skills` and confirm `content-programming` is known to the harness, including coverage status. `coverage=custom`, setup `tests/layer4/setups/packs/pack-workflows.setup.ts`.
Run `pnpm bench --list-skills` and confirm `content-programming` is known, including coverage status.
Run `pnpm bench --list-skills` and confirm `content-programming` is known, including coverage status. `coverage=custom`, setup `tests/layer4/setups/packs/pack-workflows.setup.ts`.
Run `pnpm bench --list-skills` and confirm `content-programming` is known, including coverage status. `coverage=custom`, setup `tests/layer4/setups/packs/pack-workflows.setup.ts`.
Run `pnpm bench --list-skills` and confirm `icon-handler` is known to the harness, including coverage status. `coverage=custom`, setup `tests/layer4/setups/tier23-global-workflows.setup.ts`.
Run `pnpm bench --list-skills` and confirm `icon-handler` is known to the harness, including coverage status. `coverage=custom`, setup `tests/layer4/setups/tier23-global-workflows.setup.ts`.
Run `pnpm bench --list-skills` and confirm `icon-handler` is known to the harness.
Run `pnpm bench --list-skills` and confirm `icon-handler` is known to the harness. `coverage=custom`, setup `tests/layer4/setups/tier23-global-workflows.setup.ts`.
Run `pnpm bench --list-skills` and confirm `session-triage` is known to the harness.
Run `pnpm bench --list-skills` and confirm `session-triage` is known to the harness.
Run `pnpm bench --list-skills` and record `analyze-sessions` coverage status.
Run `pnpm bench --list-skills` and record `analyze-sessions` coverage status.
Run `pnpm bench --list-skills` and record `analyze-sessions` coverage status.
Run `pnpm bench --list-skills` and record `analyze-sessions` coverage status.
Run `pnpm bench --list-skills` and record `ship` coverage status. `coverage=custom`, setup `tests/layer4/setups/tier1-workflows.setup.ts`.
Run `pnpm verify --skill analyze-sessions`; stop before bench if verification fails.
Run `pnpm verify --skill analyze-sessions`; stop before bench if verification fails.
Run `pnpm verify --skill analyze-sessions`; stop before bench if verification fails.
Run `pnpm verify --skill analyze-sessions`; stop before bench if verification fails.
Run `pnpm verify --skill analyze-sessions`; stop before bench if verification fails.
Run `pnpm verify --skill benchmark-test-skill` from `tests/` and stop if it fails.
Run `pnpm verify --skill benchmark-test-skill` from `tests/` and stop if it fails.
Run `pnpm verify --skill benchmark-test-skill` from `tests/` and stop if it fails.
Run `pnpm verify --skill benchmark-test-skill` from `tests/` and stop if it fails.
Run `pnpm verify --skill content-programming`; stop before bench if verification fails.
Run `pnpm verify --skill content-programming`; stop before bench if verification fails.
Run `pnpm verify --skill content-programming`; stop before bench if verification fails. Layer1 PASS in 3.9s; layer2 SKIP because no target-specific layer2 tests matched.
Run `pnpm verify --skill content-programming`; stop before bench if verification fails. Layer1 PASS in 4.5s; layer2 SKIP because no target-specific layer2 tests matched.
Run `pnpm verify --skill icon-handler`; stop before bench if verification fails. ✓ layer1 PASS in 8.8s; layer2 SKIP because no target-specific layer2 tests matched.
Run `pnpm verify --skill icon-handler`; stop before bench if verification fails. Layer1 PASS in 10.8s; layer2 SKIP because no target-specific layer2 tests matched.
Run `pnpm verify --skill icon-handler`; stop before bench if verification fails. Layer1 PASS in 12.3s; layer2 SKIP because no target-specific layer2 tests matched.
Run `pnpm verify --skill icon-handler`; stop before bench if verification fails. Layer1 PASS in 8.9s; layer2 SKIP because no target-specific layer2 tests matched.
Run `pnpm verify --skill session-triage` from `tests/` and stop if it fails.
Run `pnpm verify --skill session-triage` from `tests/` and stop if it fails. ✓ layer1 PASS (1,350 tests, 8.4s), layer2 SKIP (no target-specific tests).
Run `pnpm verify --skill session-triage` from `tests/` and stop if it fails. ✓ layer1 PASS (1,350 tests, 8.6s), layer2 SKIP (no target-specific tests).
Run `pnpm verify --skill session-triage` from `tests/` and stop if it fails. ✓ layer1 PASS (1,350 tests, 8.8s), layer2 SKIP (no target-specific tests).
Run `pnpm verify --skill session-triage` from `tests/` and stop if it fails. ✓ layer1 PASS (1349 tests, 8.8s), layer2 SKIP (no target-specific tests).
Run `pnpm verify --skill session-triage`.
Run `pnpm verify --skill session-triage`. ✓ layer1 PASS in 8.9s; layer2 SKIP because no target-specific layer2 tests matched.
Run a one-run benchmark smoke if focused validation passes.
Run focused layer1 benchmark setup/quality tests.
Run focused layer1 coverage and `analyze-sessions` verify.
Run focused layer1 setup/quality tests, benchmark coverage, target verify, install/skill contract checks, and whitespace validation.
Run focused layer1 setup/quality tests, benchmark coverage, target verify, install/skill contract checks, showcase data refresh/validation, targeted `rg`, and whitespace validation.
Run focused layer1 setup/quality tests, required skill checks, benchmark coverage, target verify, Codex smoke benchmark, and whitespace validation.
Run focused layer1 tests, install, skill dependency/version/routing checks, targeted retained-artifact checks, and whitespace validation; record benchmark coverage blocker.
Run focused layer1, benchmark coverage, target verify, smoke benchmark, and whitespace validation.
Run focused tests/type/build checks and browser viewport verification.
Run focused validation, benchmark coverage, verify, one-run smoke, and whitespace checks.
Run focused workflow/smoke tests, typecheck, production build, and whitespace validation.
Run install, skill integrity, coverage, showcase, and whitespace validation.
Run required skill-builder validation and one-run Claude benchmark smoke.
Run required validation, record results here, then commit and push intended changes on `master`.
Run required validation, record results, then commit and push intended changes on `master`.
Run required validation, record results, then commit and push intended changes on `master`.
Run required validation, record results, then commit and push intended changes on `master`.
Run required validation, record results, then commit and push intended changes on `master`.
Run required validation, refresh generated Skills Showcase data, and run a one-run both-agent smoke if practical.
Run skill contract and targeted text validation.
Run standard skill validation, targeted heading checks, and whitespace validation.
Run targeted and required skill validation, then record results here.
Run targeted and required validation, record results, then commit and push intended changes on `master`.
Run targeted and required validation, then record results here.
Run targeted validation, record results, then commit and push intended changes on `master`.
Run validation and record results.
Scan full available Claude/Codex history for interview-question cadence evidence.
Scan the current static site CSS, UI spec, and product spec for concrete design tokens.
Script-based validation passes for contracts, lane-spec schema, detection, and boundary checks.
search with special characters has regression tests
Sequential/direct work still defaults to committing and pushing on `main` or `master`.
Sequential/direct work still defaults to committing and pushing on `main` or `master`.
Session-analysis live tests assert structured outputs for `analyze-sessions` and `session-triage`.
Shared headless operations cover the current kanban workflow needs: board discovery/details/activity, create board/list/card, update card, move card, search, archive/restore
Shared styles and scripts provide the responsive Swiss grid/blueprint foundation without one-off page styling.
Shared styles and scripts provide the responsive Swiss grid/blueprint foundation without one-off page styling.
Skill discovery docs include feature-interview.
Skill discovery docs include targeted-skill-builder.
Skill workflow reads repo orientation, task docs, git evidence, code health signals, and full local Claude/Codex prompt history filtered to the target repo.
Skill-changing contracts prompt regeneration and curated website review when `SKILL.md` behavior or metadata changes.
Skill-changing contracts prompt regeneration and curated website review when `SKILL.md` behavior or metadata changes.
Skills defer to `turbo run` when `turbo.json` is present, fall back to `pnpm --filter` otherwise.
Skills Showcase exposes benchmark results or links to the generated matrix without confusing coverage status with completed graded runs.
Skills Showcase exposes benchmark results or links to the generated matrix without confusing coverage status with completed graded runs.
Skills Showcase generated data is refreshed and results are recorded in `tasks/todo.md`.
Skills with custom setups still use their custom fixtures and assertions.
Skills without custom setups use a generic smoke benchmark.
Spec-interview explicitly recommends roadmap after completed/updated specs when no higher-priority design gate is missing.
Standard skill validation, showcase data refresh, targeted behavior checks, and `git diff --check` pass.
Standard skill validation, showcase data refresh, targeted checks, and whitespace validation pass.
Static checks and browser visual verification pass after the change.
Static route entrypoints exist for `/`, `/workflows/`, `/packs/`, `/catalog/`, `/inspect/`, and `/follow/`.
Static route entrypoints exist for `/`, `/workflows/`, `/packs/`, `/catalog/`, `/inspect/`, and `/follow/`.
Step 39.1: Validate and promote `docs/benchmark-results-matrix.md` as a generated source of truth.
Step 39.2: Add benchmark results surface to Skills Showcase UI.
Step 39.3: Design safe disposable GitHub test-repository fixture infrastructure.
Step 39.4: Add `commit-and-push-by-feature` safe fixture plan using the disposable repo infrastructure.
Step 39.5: Add `sync` safe fixture plan using the disposable repo infrastructure.
Step 39.6: Write regression tests covering acceptance criteria.
Step 39.7: Run all tests, verify they pass, and validate the phase.
Subscriber data is never exposed in generated public assets or committed files.
Supported benchmark targets are available from the harness without reading source.
Supported target verification still works for `design-system`.
Targeted and required validation pass, including showcase data refresh/validation.
Targeted and required validation pass.
Targeted and required validation pass.
Targeted report and type validation passes, then changes are committed and pushed on `master`.
Targeted validation confirms the updated contracts.
Task pipeline is healthy; no blocking issues found. All 39 roadmap phases complete, documentation scan current.
The `analyze-sessions` fixture provides broad repeated-history evidence when expecting `targeted-skill-builder`.
The `benchmark-test-skill` fixture requires source file names and report path in generated reports.
The `no-over-remediation-route` quality criterion penalizes evidence-gate or contract-change recommendations when the output also says the existing rule is adequate.
The 2026-05-13 benchmark report and persisted Claude/Codex run evidence are inspected.
The app framework, source icon dimensions, and existing icon surfaces are audited before replacement.
The audit passes along with `./scripts/skill-deps.sh --broken`, `./scripts/skill-versions.sh --missing`, and `git diff --check`.
The benchmark coverage registry reflects any newly unblocked setup status only after the safe fixture is implemented and validated.
The benchmark coverage registry reflects any newly unblocked setup status only after the safe fixture is implemented and validated.
The benchmark prompt asks for pillars, formats, cadence constraints, portfolio balance, measurement, cleanup/refactor, and next series candidates.
The benchmark prompt tells runners the literal command must not include runner-label suffixes.
The benchmark report and persisted run evidence are inspected.
The benchmark results matrix references `benchmark/review-icon-handler-2026-05-14.md`.
The capability matrix skill records platform evidence sources, fields, missing fields, metric availability, audit depth, operational risk, and recommended next skill.
The capability matrix skill records platform evidence sources, fields, missing fields, metric availability, audit depth, operational risk, and recommended next skill.
The current `benchmark-agent-review` contract is inspected.
The current generated site payload is checked for `icon-handler` benchmark evidence.
The deterministic quality route criterion fails the same suffixed final command.
The dossier contract supports LinkedIn, personal websites, GitHub, podcasts, talks, newsletters, and product docs.
The dossier contract supports LinkedIn, personal websites, GitHub, podcasts, talks, newsletters, and product docs.
The failure is classified as a skill contract gap, benchmark fixture gap, or runner noncompliance.
The failure is classified as a skill contract gap, benchmark harness gap, or runner noncompliance.
The failure is classified as a skill contract gap, benchmark harness gap, or runner noncompliance.
The failure is classified as a skill contract gap, benchmark harness gap, or runner noncompliance.
The failure is classified as a skill contract gap, benchmark harness gap, runner infrastructure issue, or agent noncompliance.
The failure is classified as a skill contract gap, benchmark harness gap, runner infrastructure issue, or agent noncompliance.
The failure is classified as a skill contract gap, benchmark harness gap, runner infrastructure issue, or runner noncompliance.
The failure is classified as a skill contract gap, benchmark harness gap, runner infrastructure issue, or runner noncompliance.
The failure is classified as a skill contract gap, benchmark harness gap, runner infrastructure issue, or runner noncompliance.
The fix is scoped to benchmark harness coverage and route parsing, not mirrored `analyze-sessions` skill contracts.
The fix is scoped to benchmark harness coverage, not mirrored `analyze-sessions` skill contracts.
The fix is scoped to benchmark harness coverage, not mirrored `content-programming` skill contracts.
The fix is scoped to benchmark harness coverage, not mirrored `content-programming` skill contracts.
The fix is scoped to benchmark harness coverage, not mirrored `content-programming` skill contracts.
The fix is scoped to benchmark harness persistence, not the `benchmark-agent-review` skill contract.
The fix is scoped to benchmark runner classification and layer1 regression coverage, not mirrored `icon-handler` skill contracts.
The fix is scoped to mirrored `analyze-sessions` contracts and benchmark coverage, not a new meta-skill.
The fix is scoped to the benchmark fixture and layer1 setup tests, not the `session-triage` skill contract.
The fix is scoped to the benchmark fixture and layer1 setup tests, not the mirrored `session-triage` skill contracts.
The fix is scoped to the benchmark fixture/rubric and layer1 setup tests, not the `session-triage` skill contract.
The fix is scoped to the benchmark fixture/rubric and layer1 setup tests, not the `session-triage` skill contract.
The fix is scoped to the benchmark fixture/rubric and layer1 setup tests, not the mirrored `session-triage` skill contracts.
The fix is scoped to the responsive hero layout and does not refactor unrelated site sections.
The fixture prompt requires reading `session-log.md` and `tasks/lessons.md`, writing `session-triage-report.md` in the project root before optional exploration, and preserving the no-skill-change branch for one-off noncompliance with an adequate validation rule.
The fixture prompt requires stable report sections/tables for verification, benchmark metrics, raw evidence, and next route.
The fixture prompt requires verifying `session-triage-report.md` exists in the project root after writing and creating it before response if missing.
The fixture prompt says runner route convention is authoritative regardless of fixture filenames or raw session paths.
The fixture raw session path is neutral and no longer nudges Claude toward `$ship`.
The generator selects `benchmark/test-icon-handler-2026-05-14.md` instead of the stale 2026-05-13 report.
The handoff names likely owner surface and validation expectation when recommending a skill update.
The hard-coded `$targeted-skill-builder` route requirement is removed from the current one-off noncompliance fixture.
The harness supports weighted rubric criteria, critical criteria, evaluator notes, and minimum score thresholds.
The harness supports weighted rubric criteria, critical criteria, evaluator notes, and minimum score thresholds.
The lab layout stacks predictably below tablet widths and remains usable at phone widths.
The lane explicitly forbids logged-in scraping, paid API dependency, bot-protection bypass, and private-data collection.
The lane explicitly forbids logged-in scraping, paid API dependency, bot-protection bypass, and private-data collection.
The latest `ship` review output is compared against the desired remediation handoff behavior.
The launcher runs the root `install.sh` and supports `--uninstall`.
The LinkedIn lane uses owner exports and manual/public snapshots as the baseline.
The LinkedIn lane uses owner exports and manual/public snapshots as the baseline.
The output contract includes root cause, skill-contract fixes, validation checks, and confidence/evidence gaps.
The output-quality rubric accepts semantic latency/cost evidence rather than one exact serialized key format.
The output-quality rubric fails reports that preserve facts but place benchmark metrics outside the `## Benchmark Metrics` table.
The output-quality rubric rewards report ergonomics and rejects unstructured evidence dumps.
The recommendation explicitly chooses keep, reform, or try something else and explains why.
The report contract covers release timing, performance snapshot, packaging, hook/content structure, transcript evidence, comments/audience response, and prioritized fixes.
The report separates missed issues, false positives, legitimate detections, and infrastructure/tooling blockers instead of treating all failures as equal.
The reported text/diagram collision is validated against current code and visual behavior.
The repository has a repeatable audit command for missing next-step routing in mutation-capable skill contracts.
The schema skill defines normalized evidence records, metric confidence, evidence confidence, capture method, auth context, raw evidence paths, and privacy notes.
The schema skill defines normalized evidence records, metric confidence, evidence confidence, capture method, auth context, raw evidence paths, and privacy notes.
The ship manifest requires changed files, per-file purpose, user-goal mapping, tests run, skipped tests, residual risk, and next command.
The ship manifest requires changed files, per-file purpose, user-goal mapping, tests run, skipped tests, residual risk, and next command.
The skill distinguishes public evidence from owner-provided/private analytics and records evidence gaps.
The skill distinguishes public/professional evidence from private repo planning context.
The skill distinguishes public/professional evidence from private repo planning context.
The skill explains that domain packs are not globally installed and routes users to `pack` for project-local access.
The skill has a public-first evidence path and optional owner-analytics path.
The skill requires evidence coverage and forbids invented links, sponsors, disclosures, chapters, transcript details, comments, and owner-only metrics.
The skill requires source paths, capture dates, confidence levels, and evidence gaps.
The skill requires source paths, capture dates, confidence levels, and evidence gaps.
The skill supports `audit`, `draft`, and `template` modes with explicit output paths.
The three skills require persisted evidence, explicit evidence coverage, anti-fabrication constraints, output paths, archive-first replacement, and next-skill routing.
The tier1 benchmark setup, hard route assertion, and quality scoring path are inspected.
The tier1 fixture prompt, route assertions, and quality scoring expectations are checked.
The triage report names the responsible contract gap, exact recommended wording, validation checks, and next skill route.
The triage report records verdict, root cause, responsible gap, validation plan, and next route.
The workflow can scope evidence to the current repo/session directory before broad history scanning.
The workflow distinguishes user-identified mistakes from agent-verified mistakes.
The workflow explicitly avoids default broad `$analyze-sessions` behavior.
Tier 1 skills have custom Codex benchmark setups.
Tier 1 skills have custom Codex benchmark setups.
Tier 1 workflow skills have quality rubrics and evaluator coverage.
Tier 1 workflow skills have quality rubrics and evaluator coverage.
Tier 2 and Tier 3 skills have custom Codex benchmark setups or explicit blocked statuses.
Tier 2 and Tier 3 skills have custom Codex benchmark setups or explicit blocked statuses.
Tier 2/Tier 3 global skills have quality rubrics where deterministic signals are practical, or explicit blocked/deferred quality notes.
Tier 2/Tier 3 global skills have quality rubrics where deterministic signals are practical, or explicit blocked/deferred quality notes.
Tighten Playful Lab mobile CSS for chip navigation, body stacking, step card, demo panel, notebook, and controls.
Tighten route assertions/quality scoring so approval-route mentions elsewhere cannot mask a final `npm run build` or `npx next build` handoff.
Timeout errors report lock owner metadata.
Triage report records verdict, root cause, responsible gap, validation plan, and next route.
Unsupported targets such as `run` fail before any agent benchmark work with a clear unsupported-target message.
Update `benchmark/review-analyze-sessions-2026-05-15.md` with scores, findings, remediation, and next route.
Update `benchmark/review-analyze-sessions-2026-05-15.md` with scores, findings, remediation, and next route.
Update `benchmark/triage-session-triage-2026-05-13.md` with verdict, root cause, responsible gap, validation plan, and next route.
Update `pack-fixture-evidence` to require concrete fixture paths and fixture input facts.
Update `tests/layer4/setups/tier1-workflows.setup.ts` so the fixture prompt requires separate `## Benchmark Metrics` rows for pass rate, p50 latency, total cost, and raw session path with the exact evidence tokens.
Update affected skill contracts without weakening explicit write/update modes.
Update and validate `benchmark/test-analyze-sessions-2026-05-15.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Update and validate `benchmark/test-analyze-sessions-2026-05-15.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Update and validate `benchmark/test-analyze-sessions-2026-05-15.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Update benchmark coverage and pack workflow fixtures.
Update benchmark coverage metadata and record a lesson.
Update benchmark setup and layer1 coverage to protect the remediation-ready handoff.
Update both `global/claude/design-system/SKILL.md` and `global/codex/design-system/SKILL.md` with the same Markdown-heading requirement.
Update Codex skill contracts that still instruct grouped interview questions or Claude-only `AskUserQuestion` behavior.
Update discovery docs.
Update existing creator-media skill contracts instead of creating a duplicate meta-skill.
Update mirrored Claude/Codex `analyze-sessions` contracts for one runner-native command, concrete gap phrase, owner surface, validation expectation, and explicit-vs-inferred attribution.
Update pack aliases, `business-app` expansion, docs, and routing guidance.
Update the `analyze-sessions` benchmark fixture to provide broad repeated-history evidence and runner-specific `targeted-skill-builder` routes.
Update the `analyze-sessions` layer4 quality patterns to accept valid owner-surface and validation-expectation report structures.
Update the `icon-handler` benchmark prompt to distinguish verification commands from the required final next route.
Update the `icon-handler` fixture to expect `/icon-handler` for Claude and `$icon-handler` for Codex.
Update the existing pack benchmark setup rather than changing the mirrored skill contracts or creating a new skill.
Update the fixture prompt to require reading `session-log.md` and `tasks/lessons.md`, writing `session-triage-report.md` in the project root before optional exploration, and preserving the no-skill-change branch for one-off noncompliance with an adequate validation rule.
Update the output-quality rubric to require benchmark metrics inside a Markdown table under `## Benchmark Metrics`.
Update the pack benchmark setup so `content-programming` asks for pillars, formats, cadence constraints, portfolio balance, measurement, cleanup/refactor, and next series candidates.
Update the route helper so accepted bold next-route labels pass.
Update the route helper, `analyze-sessions` benchmark prompt, and quality/hard assertion path for exact final-route matching.
Update the tier1 benchmark fixture prompt, hard assertions, and output-quality rubric to require stable report sections/tables for verification, benchmark metrics, raw evidence, and next route.
update-card --progress, --description, --due flags each have at least one test
Use `benchmark/review-analyze-sessions-2026-05-15.md` and relevant lessons as the scoped evidence source.
Use the current `$creator-positioning` correction and `$session-triage` result as evidence.
Use the fresh benchmark report, raw Claude run artifacts, session-triage finding, and relevant lessons as scoped evidence.
Use the fresh benchmark-agent-review report as scoped evidence and avoid broad history scanning.
Use the triage report and relevant lessons as the scoped evidence source.
User-correction handling requires updating `tasks/lessons.md` and, when applicable, the relevant skill or validation check.
User-correction handling requires updating `tasks/lessons.md` and, when applicable, the relevant skill or validation check.
Validate focused layer1 setup/quality tests, required skill checks, benchmark coverage, target verify, Claude smoke benchmark, and whitespace.
Validate report fields, record results, then commit and push on `master`.
Validate report fields, record results, then commit and push on `master`.
Validate report fields, then commit and push intended changes on `master`.
Validate report fields, then commit and push intended changes.
Validate report fields, then commit and push intended changes.
Validate report fields, then commit and push intended changes.
Validate report fields, then commit and push intended changes.
Validate report fields, then commit and push intended changes.
Validate report fields, update docs, then commit and push intended changes on `master`.
Validate the report contains required benchmark fields.
Validate the report contains required benchmark fields.
Validate the report contains required benchmark fields.
Validate the report contains required benchmark fields.
Validate the report contains required benchmark fields.
Validate the report contains required benchmark fields.
Validate the report contains required benchmark fields.
Validate the review report contains required fields.
Validate the user's old-version hypothesis against current route code.
Validate whether `/workflows` still renders a legacy workflow block above the Playful Lab demo.
Validate, document the correction lesson, then commit and push on `master`.
Validate, record review notes, then commit and push on `master`.
Validate:
Validate:
Validation fails when a `blocked` row lacks a reason and next command.
Validation fails when a `blocked` row lacks a reason and next command.
Validation fails when a `custom` coverage row points to a missing setup.
Validation fails when a `custom` coverage row points to a missing setup.
Validation fails when a repository skill is missing from the coverage matrix.
Validation fails when a repository skill is missing from the coverage matrix.
Validation passes with dependency/version/routing audits, targeted text scans, and `git diff --check`.
Validation passes with dependency/version/routing audits, targeted text scans, install refresh, and `git diff --check`.
Validation passes with dependency/version/routing audits, targeted text scans, install refresh, and `git diff --check`.
Validation passes with install dry checks, skill dependency/version/routing audits, targeted text scans, and `git diff --check`.
Validation passes with install, skill dependency/version/routing checks, layer1 tests, targeted scans, and `git diff --check`.
Validation passes with layer1 tests, skipped live-test dry run, live Claude/Codex runs, and skill contract scripts.
Validation passes with mirrored contract checks, version checks, targeted text scans, and `git diff --check`.
Validation passes with mirrored contract scans, dependency/version/routing checks, layer1 tests, and `git diff --check`.
Validation passes with mirrored-contract scans, docs/routing scans, dependency/version/routing audits, and `git diff --check`.
Validation passes with skill dependency/version checks and targeted reference scans.
Validation passes with skill dependency/version checks and targeted reference scans.
Validation passes with skill dependency/version checks and targeted reference scans.
Validation passes with skill dependency/version checks and targeted reference scans.
Validation passes with skill dependency/version checks and targeted routing scans.
Validation passes with skill dependency/version checks, next-step routing audit, targeted mirrored-contract scans, and `git diff --check`.
Validation passes with targeted checks for LinkedIn baseline, privacy constraints, and no paid/API-first language.
Validation passes with targeted checks for LinkedIn baseline, privacy constraints, and no paid/API-first language.
Validation passes with targeted contract scans, script fixture checks, skill dependency/version/routing audits, and `git diff --check`.
Validation passes with targeted contract scans, script fixture checks, skill dependency/version/routing audits, and `git diff --check`.
Validation passes with targeted scans, skill metadata/routing checks, tests, and whitespace checks.
Validation passes with targeted scans, skill metadata/routing checks, tests, and whitespace checks.
Vercel static deployment instructions and manual launch tasks are current.
Verification confirms asset formats, production build icon route output, and generated HTML references.
Verification passes, results are recorded in `tasks/todo.md`, and changes are committed and pushed on `master`.
Verify generated asset formats, production build output icon routes, and generated HTML references.
Verify Markdown/frontmatter structure, contrast findings, and diff scope before shipping.
Verify skill integrity, targeted tests, generated data freshness if needed, and whitespace.
Verify the report against source evidence and record results here before shipping.
Write `benchmark/review-analyze-sessions-2026-05-15.md` with scores, findings, remediation, and next route.
Write `benchmark/review-content-programming-2026-05-14.md` with scores, findings, remediation, and next route.
Write `benchmark/review-content-programming-2026-05-14.md` with scores, findings, remediation, and next route.
Write `benchmark/review-icon-handler-2026-05-14.md` with scores, findings, remediation, and next route.
Write `benchmark/review-session-triage-2026-05-13.md` with score table, findings, remediation handoff, and recommended next command.
Write `benchmark/test-content-programming-2026-05-14.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write `benchmark/test-icon-handler-2026-05-13.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write `benchmark/test-icon-handler-2026-05-14.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write `benchmark/test-icon-handler-2026-05-14.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write `benchmark/test-icon-handler-2026-05-14.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write `benchmark/test-session-triage-2026-05-13.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write `benchmark/test-session-triage-2026-05-13.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write `benchmark/triage-analyze-sessions-2026-05-15.md` with verdict, root cause, responsible gap, validation plan, and next route.
Write `benchmark/triage-benchmark-test-skill-2026-05-13.md` with verdict, root cause, responsible gap, validation plan, and next route.
Write `benchmark/triage-content-programming-2026-05-14-quality.md` with verdict, root cause, responsible gap, validation plan, and next route.
Write `benchmark/triage-content-programming-2026-05-14.md` with verdict, root cause, responsible gap, validation plan, and next route.
Write `benchmark/triage-icon-handler-2026-05-13.md` with verdict, root cause, responsible gap, validation plan, and next route.
Write `benchmark/triage-icon-handler-2026-05-14-image.md` with verdict, root cause, responsible gap, validation plan, and next route.
Write `benchmark/triage-icon-handler-2026-05-14.md` with verdict, root cause, responsible gap, validation plan, and next route.
Write `benchmark/triage-session-triage-2026-05-13.md` with verdict, root cause, responsible gap, validation plan, and next route.
Write `DESIGN.md` in the Google Labs Stitch-style format with machine-readable YAML frontmatter and prose guardrails.
Write a report with counts, examples, limitations, and a recommendation: keep, reform, or replace the workflow.
Write and validate `benchmark/review-benchmark-test-skill-2026-05-13.md` with scores, findings, remediation, and next route.
Write and validate `benchmark/review-session-triage-2026-05-13.md` with scores, findings, remediation, and next route.
Write and validate `benchmark/review-session-triage-2026-05-13.md` with scores, findings, remediation, and next route.
Write and validate `benchmark/review-session-triage-2026-05-13.md` with scores, findings, remediation, and next route. ✓ Required report fields present.
Write and validate `benchmark/test-analyze-sessions-2026-05-15.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write and validate `benchmark/test-analyze-sessions-2026-05-15.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write and validate `benchmark/test-benchmark-test-skill-2026-05-12.md` with verify, benchmark, latency, cost, consistency, and raw session evidence.
Write and validate `benchmark/test-benchmark-test-skill-2026-05-12.md` with verify, benchmark, latency, cost, consistency, and raw session evidence.
Write and validate `benchmark/test-benchmark-test-skill-2026-05-13.md` with verify, benchmark, latency, cost, consistency, and raw session evidence.
Write and validate `benchmark/test-benchmark-test-skill-2026-05-13.md` with verify, benchmark, latency, cost, consistency, and raw session evidence.
Write and validate `benchmark/test-content-programming-2026-05-14.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write and validate `benchmark/test-content-programming-2026-05-14.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write and validate `benchmark/test-content-programming-2026-05-14.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Write and validate `benchmark/test-session-triage-2026-05-13.md` with verify, benchmark, latency, cost, consistency, and raw session evidence. ✓ Report updated with current 11:36 ET run data.
Write and validate `benchmark/test-session-triage-2026-05-13.md` with verify, benchmark, latency, cost, consistency, and raw session evidence. ✓ Report updated with fresh 10:40 run data.
Write and validate `benchmark/test-session-triage-2026-05-13.md` with verify, benchmark, latency, cost, consistency, and raw session evidence. ✓ Report updated with fresh run data and prior-run comparison.
Write and validate `benchmark/test-session-triage-2026-05-13.md` with verify, benchmark, latency, cost, consistency, and raw session evidence. ✓ Report updated with fresh run data.
Write and validate `benchmark/test-session-triage-2026-05-13.md` with verify, benchmark, latency, cost, consistency, raw session paths, and next route.
Write and validate `benchmark/triage-session-triage-2026-05-13.md` with verdict, root cause, responsible gap, validation plan, and next route.
Planned37
`/admin/newsletter` requires the configured admin secret.
`/feature-interview` - Triage 8 remaining unspecced ideas in `tasks/ideas.md` (cleaned from 25 on 2026-05-15; 17 removed as shipped/obsolete).
`/follow` submits valid email addresses through a first-party tRPC mutation.
`commit-and-push-by-feature` has a safe fixture plan using an explicit-permission disposable GitHub test repository.
`commit-and-push-by-feature` has a safe fixture plan using an explicit-permission disposable GitHub test repository.
A clean benchmark-results matrix lists skills with persisted evaluated benchmark data, hard pass rates, quality scores, subjective review grades when present, and raw report paths.
A clean benchmark-results matrix lists skills with persisted evaluated benchmark data, hard pass rates, quality scores, subjective review grades when present, and raw report paths.
Admin can list, search, copy active emails, and download CSV.
All phase tests pass.
Cleanup and infrastructure-block handling are documented for the disposable repository workflow.
Cleanup and infrastructure-block handling are documented for the disposable repository workflow.
Commit and push intended changes on `master`.
Commit and push intended changes on `master`.
Demo payloads include the exact benchmark prompt source and representative output excerpt/path without exposing unrelated temp paths or excessive transcript noise.
Duplicate signup behavior is idempotent.
If verify passes, run `pnpm bench --skill ship --agent both --runs 3 --chunk-size 3 --pause 0`.
Invalid emails and database failures produce appropriate public UI states without leaking internals.
Layer1 or generator tests cover the new data shape and fallback behavior when raw run artifacts are absent.
Local app validation, database-contract checks, admin access checks, and whitespace checks pass.
Neon stores subscriber records with `email`, `status`, `source_page`, `consent_text_version`, `created_at`, and `updated_at`.
No GitHub Actions are created, modified, or recommended.
No GitHub Actions are created, modified, or recommended.
No GitHub Actions are created, modified, or recommended.
No regressions in previous phase tests.
Record results here, then commit and push intended changes on `master`.
Refresh generated Skills Showcase data if curated benchmark evidence changes.
Results are recorded in `tasks/todo.md`, then committed and pushed on `master`.
Run `pnpm verify --skill ship`; stop before bench if verification fails.
Showcase data is regenerated and validated.
Skills Showcase exposes benchmark results or links to the generated matrix without confusing coverage status with completed graded runs.
Skills Showcase exposes benchmark results or links to the generated matrix without confusing coverage status with completed graded runs.
Subscriber data is never exposed in generated public assets or committed files.
The benchmark coverage registry reflects any newly unblocked setup status only after the safe fixture is implemented and validated.
The benchmark coverage registry reflects any newly unblocked setup status only after the safe fixture is implemented and validated.
The brand site renders benchmark-backed prompt/output demos for relevant pack/skill pages while preserving existing summary metrics.
The showcase data generator derives a compact demo payload from benchmark run artifacts when a skill has persisted benchmark evidence.
Write and validate `benchmark/test-ship-2026-05-16.md` with verify, benchmark, latency, cost, consistency, raw paths, and recommended next route.
Timeline
20 events
May 2026
fix
fix(youtube-title-thumbnail-audit): add YouTube Test and Compare awareness
May 16static-
docs
docs: clean up stale ideas — 17 shipped/obsolete, 8 remain
May 15static-
docs
docs: reconcile orphaned Phase 38 manual tasks as deferred
May 15static-
docs
docs: update priority task queue with orphaned manual tasks and unspecced ideas
May 15static-
docs
docs: record workflows mobile ship
May 15static-
fix
fix: improve workflows mobile lab layout
May 15static-
docs
docs: record benchmark-workflow integration session
May 15static-
feature
feat: connect workflow demos to benchmark evidence
May 15static-
docs
docs: mark skills showcase vercel hosting live
May 15static-
docs
docs: review analyze-sessions benchmark outputs
May 15static-
docs
docs: refresh analyze-sessions benchmark evidence
May 15static-
test
test: relax analyze-sessions quality rubric
May 15static-
chore
chore: refresh showcase proof metadata
May 15static-
chore
chore: refresh analyze-sessions benchmark evidence
May 15static-
test
test: require exact analyze-sessions benchmark route
May 15static-
docs
docs: review analyze-sessions benchmark outputs
May 15static-
test
test: refresh analyze-sessions benchmark evidence
May 15static-
docs
docs: record analyze-sessions handoff update
May 15static-
fix
fix: tighten analyze-sessions targeted handoff
May 15static-
test
test: review analyze-sessions benchmark outputs
May 15static-
Dev Docs
44 files
Specs
- Benchmark Custom CoverageMay 15, 20269.3 KB
- Benchmark Custom Coverage Feature InterviewMay 15, 20266.9 KB
- Code Quality Skill Pack ReportMay 15, 20269.1 KB
- Creator Platform Evidence SchemaMay 15, 202612.6 KB
- Creator Platform Evidence Schema InterviewMay 15, 20266.6 KB
- Final UI Specification: G Skillpacks — Playful Blueprint ThemeMay 15, 20265.3 KB
- First-Party G Skillpacks Newsletter CaptureMay 15, 202612.8 KB
- First-Party Skills Showcase Newsletter Capture - Interview LogMay 15, 20268.8 KB
- G Skillpacks WebsiteMay 15, 202624.6 KB
- Interview Log: Add --board flag to kanban searchMay 15, 20262.4 KB
- Interview Log: Kanban Command Test CoverageMay 15, 20262.4 KB
- Interview Log: Kanban Production HardeningMay 15, 20267.1 KB
- Interview Log: Multi-User Kanban Concurrency SupportMay 15, 20265.2 KB
- Monorepo Execution ControllerMay 15, 202618.2 KB
- Monorepo Execution Controller — Interview LogMay 15, 20266.1 KB
- Poketo Headless Auth Migration BriefMay 15, 202613.7 KB
- Project Fleet SpecificationMay 15, 20266.6 KB
- Skills Showcase Website - Interview LogMay 15, 202612.4 KB
- Spec Drift ReportMay 15, 202623.9 KB
- UI Interview Log: Skills Showcase WebsiteMay 15, 202614.4 KB
- UI Spec: G Skillpacks WebsiteMay 15, 202631.5 KB
Docs
- Benchmark Results MatrixMay 15, 20268.7 KB
- Canonical Agentic Workflow ReportMay 15, 202618.1 KB
- Codex WorkflowMay 15, 202611.9 KB
- Kanban Skill Test ResultsMay 15, 20266.5 KB
- Kanban Skill Validation (Complete)phasesMay 15, 20260.9 KB
- Operating ModesMay 15, 202624.7 KB
- Pack Workflow MatrixMay 15, 20265.3 KB
- Phase 1: Kanban Skill Suite (Completed)phasesMay 15, 20261.5 KB
- Phase 2: Proactive Board Intelligence (Completed)phasesMay 15, 20261.0 KB
- Phase 3: Board TemplatesphasesMay 15, 20261.8 KB
- Phase 4: Archive AutomationphasesMay 15, 20263.1 KB
- Phase 5: Expert Review FixesphasesMay 15, 20261.3 KB
- Phase 6: Testing Hardening IphasesMay 15, 20261.4 KB
- Phase 7: Testing Hardening IIphasesMay 15, 20261.3 KB
- Phase 8: Kanban DXphasesMay 15, 20261.2 KB
- Phase 9: Skill InfrastructurephasesMay 15, 20263.4 KB
- Project-Local Skill PacksMay 15, 202610.3 KB
- Quality Gate ContractMay 15, 20267.4 KB
- Safe Disposable GitHub Test-Repository FixturesMay 15, 20264.0 KB
- Skill Next-Step ContractsMay 15, 202612.8 KB
- Skill VersioningMay 15, 20261.3 KB
- Skills ReferenceMay 15, 202617.6 KB
- Test HarnessMay 15, 20262.2 KB