feat(eval): ao eval outcomes compile — holdout-safe rubric payload (ag-hdqu0 #compile-strip)#601
Merged
Merged
Conversation
…g-hdqu0 #compile-strip) Adds 'ao eval outcomes compile <input.json>' under evalCmd: projects a locked Task + criteria into an Outcomes rubric via the merged evalsubstrate.ProjectRubric, then re-scans (ContainsAny, guard layer 3) and REFUSES to emit any rubric that would carry a holdout answer across the cloud boundary (Managed Agents are not ZDR). Outcomes is a derived projection of SCHEMA.md, never an alternate authority. TDD: TestCompileOutcomesRubric_StripsHoldoutTarget (criteria carried, no leak), TestCompileOutcomesRubric_RefusesLeak (deny-by-default on a leaking criterion). go test/vet/build green; COMMANDS.md regenerated for the new subcommand. Closes-scenario: ag-hdqu0.1#compile-strip Bounded-context: BC1-Corpus Evidence: cli/cmd/ao/eval_outcomes_test.go
…canary (ag-hdqu0.1) CI contracts-sync contract-canary agentops-core.cli-command-surface-matrix failed: the new command was an uncovered leaf in check-cmdao-surface-parity. Added a public-stateful-fixture-needed allowlist entry (core logic unit-tested in eval_outcomes_test.go; CLI smoke needs an input.json fixture — follow-up ag-lkxx). Parity check now reports the command 'allowlisted'.
…mes command (ag-hdqu0.1) The documented-cli-help-matrix canary case hard-codes command heading counts; 'ao eval outcomes' (#### sub) bumps sub 175->176 and all 245->246, and the matrix size 245->246. 'ao eval outcomes compile' is ##### (not counted in top/sub/all). Fixture now passes (cli-help-matrix-ok). Pairs with the surface-parity allowlist entry to clear the full cli-command-surface-matrix canary. No regen path for these counts — tracked in ag-lkxx.
…ag-hdqu0.1) The documented-cli-help-matrix case asserts stdout_contains 'cli-command-headings: top=70 sub=175 all=245'; the fixture now prints 176/246 after adding the outcomes command. Updated the expected string to match (third + final counts location after the fixture's assert + printf). Canary aggregate had already risen 0.7917->0.9306; this clears the last failing case.
boshu2
added a commit
that referenced
this pull request
May 29, 2026
…cord (ag-hdqu0 #ingest-verdict) (#603) ## What Adds `ao eval outcomes ingest <score.json>` — the third slice of ag-hdqu0. Maps an Outcomes grader score (aggregate + per-criterion) onto the **one** council verdict record (`skills/council/schemas/verdict.json`: PASS/WARN/FAIL + `satisfaction_score` + `satisfaction_breakdown`), closing the **Outcomes → Knowledge Flywheel** loop without forking the verdict format. Pairs with `ao eval outcomes compile` (#601). Verdict bands: PASS ≥ threshold · FAIL < 70% of threshold · WARN between. ## One-shot surface landing (compounding from #601/ag-lkxx) `ingest` is a `#####` leaf, so heading counts are unchanged — only an allowlist entry was needed. Ran `scripts/test-agentops-contract-canaries.sh` locally **before pushing**: `cli-command-surface-matrix verdict=pass aggregate=1`. No CI canary round-trips. ## Tests (TDD) - `TestIngestOutcomesScore_ProducesVerdictRecord` — PASS verdict, satisfaction_score=aggregate, breakdown=criterion scores, schema_version 4, non-nil findings. - `TestIngestOutcomesScore_VerdictBands` — PASS/WARN/FAIL banding. - `go test ./cmd/ao` green, vet/build clean. Closes-scenario: ag-hdqu0.2#ingest-verdict Bounded-context: BC1-Corpus Evidence: cli/cmd/ao/eval_outcomes_ingest_test.go
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
ao eval outcomes compile <input.json>undereval— the second slice of ag-hdqu0, building directly on the mergedevalsubstrate.ProjectRubric(#599). Projects a locked Task + criteria into an Outcomes rubric payload.Holdout isolation (Managed Agents are NOT ZDR)
ProjectRubric(strips ground truth by construction).ContainsAny, guard layer 3):compileOutcomesRubricREFUSES to emit any rubric that would carry a holdout value (inputholdout_valuesfeeds the scan; never copied to output).judge_content_hashfor stale-rubric self-invalidation.Outcomes is a derived projection of the locked SCHEMA.md — never an alternate authority.
Tests (TDD)
TestCompileOutcomesRubric_StripsHoldoutTarget— criteria carried, zero holdout leak.TestCompileOutcomesRubric_RefusesLeak— deny-by-default errors on a leaking criterion, names the value.cmd/aotests pass,go vet/buildclean,COMMANDS.mdregenerated for the new subcommand.Next in ag-hdqu0
.2ingest→verdict;.4re-parse/stale-hash;.9gate#3 burn ledger (remaining half); the future GT-loader slice wiresholdout_valuesfrom the substrate automatically.Closes-scenario: ag-hdqu0.1#compile-strip
Bounded-context: BC1-Corpus
Evidence: cli/cmd/ao/eval_outcomes_test.go