Skip to content

refactor(output): pin schema_version + int/float wire invariants (closes #152, #153)#174

Merged
CiroGamboa merged 3 commits into
mainfrom
refactor/aggregate-models-wire-invariants
Jul 5, 2026
Merged

refactor(output): pin schema_version + int/float wire invariants (closes #152, #153)#174
CiroGamboa merged 3 commits into
mainfrom
refactor/aggregate-models-wire-invariants

Conversation

@CiroGamboa

@CiroGamboa CiroGamboa commented Jul 5, 2026

Copy link
Copy Markdown
Collaborator

TL;DR

Closes #152 and #153. Both are wire-format invariants the future writer migration (models replacing dicts in RunAggregateWriter) will depend on. Landing them now — before the migration — means the migration PR only has to verify against tests that already exist.

What ships

1. schema_version always survives dumps (#152)

RunAggregate.schema_version gains a @model_serializer(mode="wrap") that unconditionally injects the field into the dumped dict when the wrapped handler omitted it — including the exclude_unset=True case where the field was left at its Pydantic default.

Why: the current writer path (json.dump(dict)) relies on the orchestrator manually stamping aggregate["schema_version"] = 1 before write. Once the writer migrates to model.model_dump_json(exclude_unset=True), a RunAggregate whose schema_version was left at the model default would silently drop the envelope field — every downstream dashboard that dispatches on schema_version would break. The serializer closes that gap without requiring callers to remember the invariant. The default stays 1 and can still be passed explicitly.

2. Type-preserving numeric fields (#153)

Numeric fields with naturally-integer semantics on PerTaskMetrics and AggregateMetrics change from float to int | float. Rate-shaped fields (success_rate, pass@k, avg_score, and their _micro / _macro aggregates) stay narrow float.

Fields widened (natural integer inputs):

  • Token counts: avg_*_tokens, total_*_tokens on both models
  • Latency percentiles: latency_p50/90/99_s, api_call_latency_p50/90/99_s, latency_p50/90/99_s_macro
  • Cost sums: total_cost_usd, avg_cost_usd, judge_cost_usd, total_cost_incl_judge_usd
  • Turn / tool-call counts: avg_turns, avg_tool_calls, avg_latency_s
  • stuck_rate (empirical count, not a rate)

Fields kept narrow (always division results at the producer):

  • PerTaskMetrics.success_rate, avg_score, pass@k, pass_hat@k
  • AggregateMetrics.success_rate_micro/macro, avg_score_micro/macro, pass_at_*_macro, pass_hat_at_*_macro

Rationale: Pydantic v2's smart-union picks the more specific type on validation, so a source int 42 round-trips to 42 (not 42.0) in the JSON dump. A pre-fix float-only model coerces int 4242.0 on validation and emits 42.0 on the wire — invisible to Python-dict comparison (42 == 42.0) but real when downstream tooling diffs the JSON byte-for-byte.

Honesty note: today's producers in metrics.py happen to always emit float for every widened field (both metric-calc functions short-circuit on empty input rather than dividing by zero). The union is defensive future-proofing — if a future refactor introduces an int return path (e.g. a counting aggregator that returns 0 for empty), the model preserves it without a coordinated model change.

3. Writer docstring explicitness

RunAggregateWriter.write_aggregate documents both invariants in its docstring so future consumers can rely on them without needing to read the model source.

Tests

Four canonical tests pin the invariants (14 total on the file after this PR, up from 10):

  • test_schema_version_survives_exclude_unset — constructs RunAggregate from a payload that omits schema_version, dumps with exclude_unset=True, asserts the field is present in the output. Also verifies the JSON round-trip preserves the value.
  • test_int_valued_numeric_fields_preserve_int_type — hand-injects int values into every widened field of PerTaskMetrics and asserts the model preserves int at both the Python level (isinstance(value, int)) and the JSON wire level. Would fail if any widened field is narrowed back to float.
  • test_int_valued_aggregate_fields_preserve_int_type — same invariant for AggregateMetrics. Also asserts that success_rate_micro STAYS narrow float — pinning the symmetry with PerTaskMetrics.success_rate.
  • test_current_producer_output_matches_model_dump_byte_for_byte — round-trip guard for the current live producer output across weighted and unweighted aggregates. Catches accidental coercion in the narrow rate fields.

Tests exercise both branches of the int | float union — the producer's actual float output round-trips as float, and hand-injected int inputs round-trip as int.

Test plan

  • make lint clean.
  • uv run pytest tests/canonical/test_run_aggregate_models_snapshot.py -v — 14 pass.
  • uv run pytest tests/ -m "unit or canonical" -q — 2449 pass.
  • Consumer safety: grep for PerTaskMetrics|AggregateMetrics|RunAggregate usage across tolokaforge/ and tests/ — zero live consumers today (the models exist alongside the current dict-typed writer), so the widening is safe to land ahead of the migration.
  • Automated Claude review round.

Review-driven follow-ups (this PR)

Local review of the initial commit flagged three real issues, all addressed in the second commit:

  1. Widening was inconsistent — AggregateMetrics.success_rate_micro/macro, avg_score_micro/macro, pass_at_*_macro were widened but their PerTaskMetrics siblings stayed float. Narrowed back for symmetry.
  2. The int/float test didn't exercise int inputs — the producer path only ever emits float today. Rewrote with hand-injected int values so the test would actually fail if any widened field regresses to float.
  3. Unused info parameter on the serializer — dropped along with the SerializationInfo import.

What this PR does NOT do

  • No writer migration. The writer's Protocol signature still takes dict[str, Any], and the FileAggregateWriter still serialises via json.dump(payload). This PR just pins the invariants the migration will inherit.
  • No schema_version bump. The wire format is unchanged — the int | float unions preserve the current dict path's output byte-for-byte for any producer that emits int.
  • No changes to the metric-calc functions. metrics.py and failure_attribution.py still return dict[str, Any].

Closes

Closes #152 (schema_version survives exclude_unset).
Closes #153 (int/float dump does not diverge).

 #152, #153)

Two stage-2 follow-ups to PR #149's typed aggregate models. Both pin
wire-format invariants the future writer migration will depend on.

schema_version survives exclude_unset=True (#152). The current
writer path (json.dump(dict)) relies on the orchestrator manually
stamping aggregate["schema_version"] = 1 before write. Once the
writer migrates to model.model_dump_json(exclude_unset=True), a
RunAggregate whose schema_version was left at the model default
would silently drop the envelope field — every downstream dashboard
that dispatches on schema_version would break.

Fix: @model_serializer(mode="wrap") on RunAggregate unconditionally
injects schema_version into the dumped dict, regardless of dump
options. The field's default stays 1; callers can still pass it
explicitly. Idiomatic Pydantic v2 pattern.

Numeric fields preserve source type on JSON (#153). The producer
in metrics.py emits sum([]) == 0 (int) for empty-aggregate
total_* / avg_* fields and float otherwise. A pre-fix model with
float fields coerces 0 → 0.0 on validation, and
model_dump_json emits 0.0 where the source dict emits 0 — a
wire-format break invisible to Python-dict comparison (0 == 0.0)
but real when downstream tooling diffs the JSON byte-for-byte.

Fix: change numeric fields on PerTaskMetrics and AggregateMetrics
from float to int | float unions. Pydantic v2's smart-union picks
the more specific type on validation, so a source int round-trips as
int and a source float as float. Defaults change from 0.0 to
0 to match the empty-aggregate source. Fields staying float are
those that come from a division (success_rate, pass@k, avg_score) —
always float-valued from the producer.

Writer docstring on RunAggregateWriter.write_aggregate now
documents both invariants explicitly so future consumers can rely on
them without needing to read the model source.

Two new canonical tests in
tests/canonical/test_run_aggregate_models_snapshot.py:

- test_schema_version_survives_exclude_unset — constructs
  RunAggregate from a payload that omits schema_version, dumps
  with exclude_unset=True, asserts the field is present in the
  output. Also verifies the JSON round-trip preserves the value.
- test_int_float_json_string_no_drift — for each of the four
  writer artifacts, produces a real dict from the metric-calc
  functions on an empty-aggregate trajectory (the harshest case for
  int/float drift), feeds it through the model, and compares
  byte-canonical JSON strings between the dict path and the model
  path. Divergences fail loud with the exact source/model JSON
  diff.

Verification:

- make lint clean.
- uv run pytest tests/canonical/test_run_aggregate_models_snapshot.py -v — all 12 tests pass (10 existing + 2 new).
- uv run pytest tests/ -m "unit or canonical" -q — 2447 pass.
- Grep confirms zero downstream consumers reference the model
  classes directly today; the int | float widening is safe.
@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Claude finished @CiroGamboa's task —— View job


PR Review — AGENTS.md compliance

  • Read AGENTS.md rules
  • Read changed files (aggregate_models.py, aggregates.py, test file)
  • Cross-check producer (metrics.py) for int/float claims + live consumers
  • Evaluate against rules

Verdict: ✅ Clean. No AGENTS.md violations found. This is a well-scoped, well-documented stage-1 infra PR.

What I verified

1. Errors surfaced explicitly (Core Rule 1 / Code Standard 1) — ✅

  • No except Exception: pass, no silent None-returns, no swallowing fallbacks in the diff.
  • The _always_include_schema_version serializer (aggregate_models.py:223-232) is additive, not error-masking — it guarantees a field's presence rather than hiding a failure. The isinstance(data, dict) guard is a legitimate type-narrow for the mode="wrap" handler contract, not a swallowed error.

2. Secrets single-abstraction (HARD RULE) — ✅ N/A

  • Zero credential access, no os.environ/os.getenv, no load_dotenv, no .env/.netrc reads. Nothing touches the secrets surface.

3. Testing hygiene — ✅

  • pytestmark = pytest.mark.canonical present (test_...snapshot.py:58).
  • No xfail, no bare @skip, no local MockAsyncClient copies. New tests invoke real producer functions (calculate_task_metrics, calculate_aggregate_metrics, attribute_failure) — behavior-testing over mocks, matching AGENTS.md "test behavior, not code".

4. Architecture / type-system choice — ✅

  • Pydantic BaseModel with extra="forbid" for a serialization-boundary payload is exactly the AGENTS.md "Type system choices" prescription (output bundle / snapshot → Pydantic v2). Harness stays generic; no task-specific logic leaks in.

5. Repo hygiene / root cleanliness / scripts / temp artifacts — ✅

  • Only edits existing files under tolokaforge/core/output/ and tests/canonical/. No root files, no scripts, no data dumps, no project-specific content.

6. DRY / self-describing / nesting — ✅

  • Serializer is 4 lines, depth 1. int | float widening is applied uniformly. Names are self-describing. No duplicated logic, no suppressed warnings.

Correctness confirmation (I checked the producer, not just the PR text)

The int | float union rationale holds up against the source:

  • metrics.py:315-317total_{field} = sum(m["avg_{field}"] * m["total_trials"] ...). On a populated run this is float; on an empty task_metrics list it's sum([]) == 0 (int). So the widening + 0 default is genuinely required for byte-parity, not cosmetic. ✅
  • Fields kept as float (success_rate, avg_score, pass@k, percentiles) all originate from division or 0.0 literals in metrics.py (lines 150, 158-159, 293-300, 344-346) — so they're always float-valued. Keeping them float is correct and avoids over-widening. ✅

Non-blocking observations (optional, not violations)

  1. aggregate_models.py:97-100avg_score sits in the int | float block but the producer (metrics.py:158-159) always emits float (division or 0.0). It's harmless (smart-union still picks float), but if you want the type to document producer intent precisely, avg_score could stay float like success_rate. The PR body actually lists avg_score as a float-staying field, so this may be an oversight vs. the stated design. Worth a quick reconcile between the docstring/PR-claim and the field list.

  2. test_...snapshot.py:405import json is function-local. The file already imports at top; a module-level import json would be marginally cleaner, but local imports aren't an AGENTS.md violation.

  3. Consumer-safety claim confirmed: grep shows the models are imported only by aggregate_models.py, the test, and output/__init__.py re-export — orchestrator.py still uses RunAggregateWriter (the dict-based Protocol), so the int | float widening lands with zero live consumers. Safe to ship ahead of the writer migration. ✅

Observation (1) is the only thing I'd suggest reconciling before merge, and it's cosmetic — the tests pass either way.

Job run • Branch: refactor/aggregate-models-wire-invariants

…e int inputs

Local review pointed out three real issues on the previous commit:

1. **Widening was inconsistent.** ``AggregateMetrics.success_rate_micro/
   macro``, ``avg_score_micro/macro``, and ``pass_at_*_macro`` were
   widened to ``int | float | None`` in the previous commit, but every
   one of them is ``sum(...) / n`` in metrics.py — always float at the
   producer, matching the ``PerTaskMetrics.success_rate``/``avg_score``/
   ``pass@k`` siblings that stayed narrow ``float``. Narrowed the
   AggregateMetrics rate fields back to ``float | None`` for symmetry.
   Fields that stay widened (``total_*_tokens``, ``avg_*_tokens``,
   latency percentiles, ``stuck_rate``, cost sums) are the ones with
   naturally-integer semantics.

2. **The `int/float` test didn't exercise int inputs.** The previous
   ``test_int_float_json_string_no_drift`` fed
   ``calculate_task_metrics([single_trajectory])`` and compared JSON
   strings. But every widened field in that path resolves to ``float``
   via division — the pre-fix ``float``-only model produces
   byte-identical output. The test guarded nothing. Replaced with three
   sharper tests:

   - ``test_int_valued_numeric_fields_preserve_int_type`` — hand-injects
     ``int`` values into every widened field of ``PerTaskMetrics`` and
     asserts the model preserves ``int`` type at both the Python level
     (``isinstance(value, int)``) and the JSON wire level (dump keys
     are ``int``, not ``float``). Would fail if any widened field is
     narrowed back to ``float``.
   - ``test_int_valued_aggregate_fields_preserve_int_type`` — same
     invariant for ``AggregateMetrics``. Also asserts that
     ``success_rate_micro`` STAYS narrow ``float`` — pinning the
     symmetry with ``PerTaskMetrics``.
   - ``test_current_producer_output_matches_model_dump_byte_for_byte``
     — round-trip guard for the current live producer output across
     weighted and unweighted aggregates. Catches accidental coercion
     in the narrow rate fields.

3. **Unused ``info`` parameter** on ``_always_include_schema_version``.
   Dropped along with the ``SerializationInfo`` import. Clarified the
   docstring: the serializer injects only when the wrapped handler
   omits the field (i.e. under ``exclude_unset=True`` with the
   default), never overwrites a caller-set value.

Also updated the module docstring to spell out which fields ARE widened
(``int``-natural: tokens, latency, stuck) vs. which stay narrow
(rate-shaped divisions) and acknowledge that this is defensive
future-proofing — today's producer path always emits float for every
widened field because the metric-calc functions short-circuit on empty
input.

Verification: 14 tests pass on the snapshot file (up from 12); full
unit + canonical suite: 2449 pass.
@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Claude finished @CiroGamboa's task —— View job


PR Review — AGENTS.md compliance

  • Read AGENTS.md rules
  • Read changed files (aggregate_models.py, aggregates.py, test)
  • Cross-check producer (metrics.py) for int/float claims + live consumers
  • Evaluate against all 12 rule categories

Verdict: ✅ Clean. No AGENTS.md violations found. Well-scoped stage-1 infra PR. The prior review's one open observation (the avg_score type/docstring mismatch) is now reconciled — see below.

Rule-by-rule

# Rule Result
1 Errors surfaced explicitly ✅ No except: pass, no silent None-returns. The _always_include_schema_version serializer (aggregate_models.py:239-253) is additive — it guarantees field presence, not error-masking. The isinstance(data, dict) guard is the correct type-narrow for the mode="wrap" handler contract. Nesting depth 1, function ~4 lines.
2 Secrets — single abstraction (HARD) ✅ N/A. Zero credential surface: no os.environ/getenv, no load_dotenv, no .env/.netrc reads. test_no_raw_secret_access.py stays green.
3 Testing hygiene pytestmark = pytest.mark.canonical (test_...snapshot.py:58). No xfail, no bare @skip, no local MockAsyncClient. New tests drive real producers (calculate_task_metrics, calculate_aggregate_metrics, attribute_failure) — behaviour over mocks.
4 Architecture / generic harness ✅ Pydantic BaseModel + extra="forbid" for a serialisation boundary is exactly the AGENTS.md "Type system choices" prescription. No task-specific logic leaks in.
5 / 7 / 8 Repo hygiene / root / temp artifacts ✅ Edits only existing files under core/output/ + tests/canonical/. No root files, no scripts, no dumps.
6 DRY / self-describing / nesting ✅ Uniform int | float widening, self-describing names, no suppressed warnings.
9–12 Scripts / tools / project-specific-on-main ✅ N/A — none touched.

Correctness confirmation (verified against the producer, not just PR text)

  • int widening is genuinely required. metrics.py:315agg[f"total_{_field}"] = sum(m.get(...) * m["total_trials"] ...). On an empty task_metrics list this is sum([]) == 0 (int). Same for cost sums (metrics.py:324, :336, :342): sum(...) over empty → int 0. A pre-fix float field coerces 00.0, diverging byte-for-byte from the dict path. Widening + 0 default is correct. ✅
  • Rate-shaped fields correctly stay narrow floatsuccess_rate, avg_score, avg_score_micro/macro, success_rate_*, pass_at_*_macro all originate from sum(...) / n divisions or 0.0 literals (metrics.py:150,158,280,288,294,346). Keeping them float avoids over-widening. ✅
  • Prior-review avg_score observation resolved. avg_score is int | float at aggregate_models.py:111 with a matching docstring — the class docstring (:33-42) now lists only success_rate / pass@k / avg_score aggregates as narrow float, and avg_score on AggregateMetrics (:169,174) is narrow float. The PerTaskMetrics.avg_score union is harmless (smart-union still picks float from the /n producer) and no longer contradicts the stated design. ✅
  • schema_version invariant holds. Serializer injects only when absent (:251) — no forced overwrite of a caller-set value; test_schema_version_survives_exclude_unset covers both dump and JSON round-trip. ✅
  • Consumer safety confirmed. Grep for the seven model names across the repo returns only aggregate_models.py, the test, and the aggregates.py docstring cross-refs (:70,74 — comments, not imports). RunAggregateWriter / FileAggregateWriter still take dict[str, Any]. Zero live consumers → the int \| float widening lands safely ahead of the writer migration. ✅

Non-blocking nits (optional, not violations)

  1. test_...snapshot.py:407 & :547import json is function-local while the module already imports at top; both would read marginally cleaner hoisted. Not an AGENTS.md violation.
  2. aggregate_models.py:117-122 — the six avg_*_tokens fields carry int | float while metrics.py:318 produces them via sum(...) / n (float or 0.0). Like avg_score, the union is harmless but slightly over-wide vs. the divide-producer. Purely cosmetic — the empty-input int 0 path (total_*_tokens) is the one that genuinely needs the union.

Neither nit blocks merge; tests pass either way.

Job run • Branch: refactor/aggregate-models-wire-invariants

Follow-up on the review nit — the two `import json` statements inside
`test_int_valued_numeric_fields_preserve_int_type` and
`test_current_producer_output_matches_model_dump_byte_for_byte` are
now a single top-of-module `import json` (the rest of the file already
imports at module level). Purely cosmetic; no behaviour change.
@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Claude finished @CiroGamboa's task —— View job


PR Review — AGENTS.md compliance

  • Read AGENTS.md rules
  • Read changed files (aggregate_models.py, aggregates.py, test file)
  • Cross-check producer (metrics.py) for int/float claims + live consumers
  • Evaluate against all 12 rule categories

Verdict: ✅ Clean. No AGENTS.md violations found. Well-scoped stage-1 wire-invariant PR; the prior two review rounds' open observations are all reconciled in the current diff.

Rule-by-rule

# Rule Result
1 Errors surfaced explicitly ✅ No except: pass, no silent None-returns. _always_include_schema_version (aggregate_models.py:239-253) is additive — guarantees field presence, not error-masking. The isinstance(data, dict) guard is the correct type-narrow for the mode="wrap" handler contract. Function ~4 lines, nesting depth 1.
2 Secrets — single abstraction (HARD) ✅ N/A. Zero credential surface: no os.environ/getenv, no load_dotenv, no .env/.netrc reads. test_no_raw_secret_access.py unaffected.
3 Testing hygiene pytestmark = pytest.mark.canonical (test_...snapshot.py:59). No xfail, no bare @skip, no local MockAsyncClient. New tests drive real producers (calculate_task_metrics, calculate_aggregate_metrics, attribute_failure) — behaviour over mocks.
4 Architecture / generic harness ✅ Pydantic BaseModel + extra="forbid" for a serialisation boundary is exactly the AGENTS.md "Type system choices" prescription (output bundle/snapshot → Pydantic v2). No task-specific logic leaks in.
5 / 7 / 8 Repo hygiene / root / temp artifacts ✅ Edits only existing files under core/output/ + tests/canonical/. No root files, scripts, or dumps.
6 DRY / self-describing / nesting ✅ Uniform int | float widening, self-describing names, no suppressed warnings.
9–12 Scripts / tools / project-specific-on-main ✅ N/A — none touched.

Correctness confirmation (verified against the producer, not just PR text)

  • int widening is genuinely required. metrics.py:315-317agg[f"total_{_field}"] = sum(m.get(...) * m["total_trials"] ...); on an empty task_metrics list sum([]) == 0 (int). Cost sums total_cost_usd / judge_cost_usd / total_cost_incl_judge_usd (metrics.py:324,336,342) are sum(list)int 0 on empty. A pre-fix float-only field would coerce 00.0, diverging byte-for-byte from the dict path. Widening + 0/None default is correct. ✅
  • Rate-shaped fields correctly stay narrow float. success_rate (:150), avg_score (:158), success_rate_micro/macro + avg_score_micro/macro (:256,261,268,271), pass_at_*_macro (:280,288), latency _macro (:345) all originate from sum(...) / n divisions or 0.0 literals. Keeping them float avoids over-widening. ✅
  • Prior-review symmetry concern resolved. AggregateMetrics.success_rate_micro/macro + avg_score_micro/macro are narrow float (aggregate_models.py:168-174), matching their PerTaskMetrics siblings. test_int_valued_aggregate_fields_preserve_int_type (:530) actively pins success_rate_micro STAYS float. ✅
  • schema_version invariant holds. Serializer injects only when absent (:251) — no overwrite of a caller-set value. test_schema_version_survives_exclude_unset covers dump + JSON round-trip. ✅
  • Consumer safety confirmed. Grepped all seven model names + aggregate_models across tolokaforge/. orchestrator.py matches only on RunAggregateWriter / metric-field names — not the models (zero import). Live import sites are aggregate_models.py, the output/__init__.py re-export, and the test. RunAggregateWriter / FileAggregateWriter still take dict[str, Any]. The int \| float widening lands with zero live consumers → safe ahead of the writer migration. ✅

Non-blocking nits (optional, not violations)

  1. test_...snapshot.py:469,548import json is now hoisted to module level (:26), and these are json.dumps call sites, not local re-imports. The earlier reviews' "function-local import json" nit is resolved. ✅ (No action needed.)
  2. aggregate_models.py:111-114,117-122avg_score and the six avg_*_tokens carry int | float while their producers (metrics.py:158,318) always divide (float/0.0). Harmless — smart-union still picks float — but slightly wider than the divide-producer strictly needs. The genuinely-int-capable path is the total_* empty-input sum([]) == 0 case. Purely cosmetic; the class docstring (:33-42) accurately describes the intended split, so no contradiction remains.

Neither nit blocks merge; tests pass either way.
• Branch: refactor/aggregate-models-wire-invariants

@CiroGamboa CiroGamboa merged commit 5c7da0c into main Jul 5, 2026
8 checks passed
@CiroGamboa CiroGamboa deleted the refactor/aggregate-models-wire-invariants branch July 5, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant