feat(eval): ao eval outcomes compile — holdout-safe rubric payload (ag-hdqu0 #compile-strip) by boshu2 · Pull Request #601 · boshu2/agentops

boshu2 · 2026-05-29T18:50:07Z

What

Adds ao eval outcomes compile <input.json> under eval — the second slice of ag-hdqu0, building directly on the merged evalsubstrate.ProjectRubric (#599). Projects a locked Task + criteria into an Outcomes rubric payload.

Holdout isolation (Managed Agents are NOT ZDR)

Projection via ProjectRubric (strips ground truth by construction).
Deny-by-default re-scan (ContainsAny, guard layer 3): compileOutcomesRubric REFUSES to emit any rubric that would carry a holdout value (input holdout_values feeds the scan; never copied to output).
Carries judge_content_hash for stale-rubric self-invalidation.

Outcomes is a derived projection of the locked SCHEMA.md — never an alternate authority.

Tests (TDD)

TestCompileOutcomesRubric_StripsHoldoutTarget — criteria carried, zero holdout leak.
TestCompileOutcomesRubric_RefusesLeak — deny-by-default errors on a leaking criterion, names the value.
9184 cmd/ao tests pass, go vet/build clean, COMMANDS.md regenerated for the new subcommand.

Next in ag-hdqu0

.2 ingest→verdict; .4 re-parse/stale-hash; .9 gate#3 burn ledger (remaining half); the future GT-loader slice wires holdout_values from the substrate automatically.

Closes-scenario: ag-hdqu0.1#compile-strip
Bounded-context: BC1-Corpus
Evidence: cli/cmd/ao/eval_outcomes_test.go

…g-hdqu0 #compile-strip) Adds 'ao eval outcomes compile <input.json>' under evalCmd: projects a locked Task + criteria into an Outcomes rubric via the merged evalsubstrate.ProjectRubric, then re-scans (ContainsAny, guard layer 3) and REFUSES to emit any rubric that would carry a holdout answer across the cloud boundary (Managed Agents are not ZDR). Outcomes is a derived projection of SCHEMA.md, never an alternate authority. TDD: TestCompileOutcomesRubric_StripsHoldoutTarget (criteria carried, no leak), TestCompileOutcomesRubric_RefusesLeak (deny-by-default on a leaking criterion). go test/vet/build green; COMMANDS.md regenerated for the new subcommand. Closes-scenario: ag-hdqu0.1#compile-strip Bounded-context: BC1-Corpus Evidence: cli/cmd/ao/eval_outcomes_test.go

…canary (ag-hdqu0.1) CI contracts-sync contract-canary agentops-core.cli-command-surface-matrix failed: the new command was an uncovered leaf in check-cmdao-surface-parity. Added a public-stateful-fixture-needed allowlist entry (core logic unit-tested in eval_outcomes_test.go; CLI smoke needs an input.json fixture — follow-up ag-lkxx). Parity check now reports the command 'allowlisted'.

…mes command (ag-hdqu0.1) The documented-cli-help-matrix canary case hard-codes command heading counts; 'ao eval outcomes' (#### sub) bumps sub 175->176 and all 245->246, and the matrix size 245->246. 'ao eval outcomes compile' is ##### (not counted in top/sub/all). Fixture now passes (cli-help-matrix-ok). Pairs with the surface-parity allowlist entry to clear the full cli-command-surface-matrix canary. No regen path for these counts — tracked in ag-lkxx.

…ag-hdqu0.1) The documented-cli-help-matrix case asserts stdout_contains 'cli-command-headings: top=70 sub=175 all=245'; the fixture now prints 176/246 after adding the outcomes command. Updated the expected string to match (third + final counts location after the fixture's assert + printf). Canary aggregate had already risen 0.7917->0.9306; this clears the last failing case.

…cord (ag-hdqu0 #ingest-verdict) (#603) ## What Adds `ao eval outcomes ingest <score.json>` — the third slice of ag-hdqu0. Maps an Outcomes grader score (aggregate + per-criterion) onto the **one** council verdict record (`skills/council/schemas/verdict.json`: PASS/WARN/FAIL + `satisfaction_score` + `satisfaction_breakdown`), closing the **Outcomes → Knowledge Flywheel** loop without forking the verdict format. Pairs with `ao eval outcomes compile` (#601). Verdict bands: PASS ≥ threshold · FAIL < 70% of threshold · WARN between. ## One-shot surface landing (compounding from #601/ag-lkxx) `ingest` is a `#####` leaf, so heading counts are unchanged — only an allowlist entry was needed. Ran `scripts/test-agentops-contract-canaries.sh` locally **before pushing**: `cli-command-surface-matrix verdict=pass aggregate=1`. No CI canary round-trips. ## Tests (TDD) - `TestIngestOutcomesScore_ProducesVerdictRecord` — PASS verdict, satisfaction_score=aggregate, breakdown=criterion scores, schema_version 4, non-nil findings. - `TestIngestOutcomesScore_VerdictBands` — PASS/WARN/FAIL banding. - `go test ./cmd/ao` green, vet/build clean. Closes-scenario: ag-hdqu0.2#ingest-verdict Bounded-context: BC1-Corpus Evidence: cli/cmd/ao/eval_outcomes_ingest_test.go

github-actions Bot added the cli label May 29, 2026

boshu2 added 3 commits May 29, 2026 15:12

boshu2 merged commit a3900ad into main May 29, 2026
14 checks passed

boshu2 deleted the feat/ag-hdqu0.1-outcomes-compile branch May 29, 2026 19:49

boshu2 mentioned this pull request May 29, 2026

feat(eval): ao eval outcomes ingest — Outcomes score → one verdict record (ag-hdqu0 #ingest-verdict) #603

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): ao eval outcomes compile — holdout-safe rubric payload (ag-hdqu0 #compile-strip)#601

feat(eval): ao eval outcomes compile — holdout-safe rubric payload (ag-hdqu0 #compile-strip)#601
boshu2 merged 4 commits into
mainfrom
feat/ag-hdqu0.1-outcomes-compile

boshu2 commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

boshu2 commented May 29, 2026

What

Holdout isolation (Managed Agents are NOT ZDR)

Tests (TDD)

Next in ag-hdqu0

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant