Skip to content

feat(eval): ao eval outcomes compile — holdout-safe rubric payload (ag-hdqu0 #compile-strip)#601

Merged
boshu2 merged 4 commits into
mainfrom
feat/ag-hdqu0.1-outcomes-compile
May 29, 2026
Merged

feat(eval): ao eval outcomes compile — holdout-safe rubric payload (ag-hdqu0 #compile-strip)#601
boshu2 merged 4 commits into
mainfrom
feat/ag-hdqu0.1-outcomes-compile

Conversation

@boshu2
Copy link
Copy Markdown
Owner

@boshu2 boshu2 commented May 29, 2026

What

Adds ao eval outcomes compile <input.json> under eval — the second slice of ag-hdqu0, building directly on the merged evalsubstrate.ProjectRubric (#599). Projects a locked Task + criteria into an Outcomes rubric payload.

Holdout isolation (Managed Agents are NOT ZDR)

  • Projection via ProjectRubric (strips ground truth by construction).
  • Deny-by-default re-scan (ContainsAny, guard layer 3): compileOutcomesRubric REFUSES to emit any rubric that would carry a holdout value (input holdout_values feeds the scan; never copied to output).
  • Carries judge_content_hash for stale-rubric self-invalidation.

Outcomes is a derived projection of the locked SCHEMA.md — never an alternate authority.

Tests (TDD)

  • TestCompileOutcomesRubric_StripsHoldoutTarget — criteria carried, zero holdout leak.
  • TestCompileOutcomesRubric_RefusesLeak — deny-by-default errors on a leaking criterion, names the value.
  • 9184 cmd/ao tests pass, go vet/build clean, COMMANDS.md regenerated for the new subcommand.

Next in ag-hdqu0

.2 ingest→verdict; .4 re-parse/stale-hash; .9 gate#3 burn ledger (remaining half); the future GT-loader slice wires holdout_values from the substrate automatically.

Closes-scenario: ag-hdqu0.1#compile-strip
Bounded-context: BC1-Corpus
Evidence: cli/cmd/ao/eval_outcomes_test.go

…g-hdqu0 #compile-strip)

Adds 'ao eval outcomes compile <input.json>' under evalCmd: projects a locked
Task + criteria into an Outcomes rubric via the merged evalsubstrate.ProjectRubric,
then re-scans (ContainsAny, guard layer 3) and REFUSES to emit any rubric that
would carry a holdout answer across the cloud boundary (Managed Agents are not
ZDR). Outcomes is a derived projection of SCHEMA.md, never an alternate authority.

TDD: TestCompileOutcomesRubric_StripsHoldoutTarget (criteria carried, no leak),
TestCompileOutcomesRubric_RefusesLeak (deny-by-default on a leaking criterion).
go test/vet/build green; COMMANDS.md regenerated for the new subcommand.

Closes-scenario: ag-hdqu0.1#compile-strip
Bounded-context: BC1-Corpus
Evidence: cli/cmd/ao/eval_outcomes_test.go
@github-actions github-actions Bot added the cli label May 29, 2026
boshu2 added 3 commits May 29, 2026 15:12
…canary (ag-hdqu0.1)

CI contracts-sync contract-canary agentops-core.cli-command-surface-matrix failed:
the new command was an uncovered leaf in check-cmdao-surface-parity. Added a
public-stateful-fixture-needed allowlist entry (core logic unit-tested in
eval_outcomes_test.go; CLI smoke needs an input.json fixture — follow-up ag-lkxx).
Parity check now reports the command 'allowlisted'.
…mes command (ag-hdqu0.1)

The documented-cli-help-matrix canary case hard-codes command heading counts;
'ao eval outcomes' (#### sub) bumps sub 175->176 and all 245->246, and the matrix
size 245->246. 'ao eval outcomes compile' is ##### (not counted in top/sub/all).
Fixture now passes (cli-help-matrix-ok). Pairs with the surface-parity allowlist
entry to clear the full cli-command-surface-matrix canary. No regen path for these
counts — tracked in ag-lkxx.
…ag-hdqu0.1)

The documented-cli-help-matrix case asserts stdout_contains 'cli-command-headings:
top=70 sub=175 all=245'; the fixture now prints 176/246 after adding the outcomes
command. Updated the expected string to match (third + final counts location after
the fixture's assert + printf). Canary aggregate had already risen 0.7917->0.9306;
this clears the last failing case.
@boshu2 boshu2 merged commit a3900ad into main May 29, 2026
14 checks passed
@boshu2 boshu2 deleted the feat/ag-hdqu0.1-outcomes-compile branch May 29, 2026 19:49
boshu2 added a commit that referenced this pull request May 29, 2026
…cord (ag-hdqu0 #ingest-verdict) (#603)

## What
Adds `ao eval outcomes ingest <score.json>` — the third slice of
ag-hdqu0. Maps an Outcomes grader score (aggregate + per-criterion) onto
the **one** council verdict record
(`skills/council/schemas/verdict.json`: PASS/WARN/FAIL +
`satisfaction_score` + `satisfaction_breakdown`), closing the **Outcomes
→ Knowledge Flywheel** loop without forking the verdict format. Pairs
with `ao eval outcomes compile` (#601).

Verdict bands: PASS ≥ threshold · FAIL < 70% of threshold · WARN
between.

## One-shot surface landing (compounding from #601/ag-lkxx)
`ingest` is a `#####` leaf, so heading counts are unchanged — only an
allowlist entry was needed. Ran
`scripts/test-agentops-contract-canaries.sh` locally **before pushing**:
`cli-command-surface-matrix verdict=pass aggregate=1`. No CI canary
round-trips.

## Tests (TDD)
- `TestIngestOutcomesScore_ProducesVerdictRecord` — PASS verdict,
satisfaction_score=aggregate, breakdown=criterion scores, schema_version
4, non-nil findings.
- `TestIngestOutcomesScore_VerdictBands` — PASS/WARN/FAIL banding.
- `go test ./cmd/ao` green, vet/build clean.

Closes-scenario: ag-hdqu0.2#ingest-verdict
Bounded-context: BC1-Corpus
Evidence: cli/cmd/ao/eval_outcomes_ingest_test.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant