refactor(output): pin schema_version + int/float wire invariants (closes #152, #153)#174
Conversation
#152, #153) Two stage-2 follow-ups to PR #149's typed aggregate models. Both pin wire-format invariants the future writer migration will depend on. schema_version survives exclude_unset=True (#152). The current writer path (json.dump(dict)) relies on the orchestrator manually stamping aggregate["schema_version"] = 1 before write. Once the writer migrates to model.model_dump_json(exclude_unset=True), a RunAggregate whose schema_version was left at the model default would silently drop the envelope field — every downstream dashboard that dispatches on schema_version would break. Fix: @model_serializer(mode="wrap") on RunAggregate unconditionally injects schema_version into the dumped dict, regardless of dump options. The field's default stays 1; callers can still pass it explicitly. Idiomatic Pydantic v2 pattern. Numeric fields preserve source type on JSON (#153). The producer in metrics.py emits sum([]) == 0 (int) for empty-aggregate total_* / avg_* fields and float otherwise. A pre-fix model with float fields coerces 0 → 0.0 on validation, and model_dump_json emits 0.0 where the source dict emits 0 — a wire-format break invisible to Python-dict comparison (0 == 0.0) but real when downstream tooling diffs the JSON byte-for-byte. Fix: change numeric fields on PerTaskMetrics and AggregateMetrics from float to int | float unions. Pydantic v2's smart-union picks the more specific type on validation, so a source int round-trips as int and a source float as float. Defaults change from 0.0 to 0 to match the empty-aggregate source. Fields staying float are those that come from a division (success_rate, pass@k, avg_score) — always float-valued from the producer. Writer docstring on RunAggregateWriter.write_aggregate now documents both invariants explicitly so future consumers can rely on them without needing to read the model source. Two new canonical tests in tests/canonical/test_run_aggregate_models_snapshot.py: - test_schema_version_survives_exclude_unset — constructs RunAggregate from a payload that omits schema_version, dumps with exclude_unset=True, asserts the field is present in the output. Also verifies the JSON round-trip preserves the value. - test_int_float_json_string_no_drift — for each of the four writer artifacts, produces a real dict from the metric-calc functions on an empty-aggregate trajectory (the harshest case for int/float drift), feeds it through the model, and compares byte-canonical JSON strings between the dict path and the model path. Divergences fail loud with the exact source/model JSON diff. Verification: - make lint clean. - uv run pytest tests/canonical/test_run_aggregate_models_snapshot.py -v — all 12 tests pass (10 existing + 2 new). - uv run pytest tests/ -m "unit or canonical" -q — 2447 pass. - Grep confirms zero downstream consumers reference the model classes directly today; the int | float widening is safe.
|
Claude finished @CiroGamboa's task —— View job PR Review — AGENTS.md compliance
Verdict: ✅ Clean. No AGENTS.md violations found. This is a well-scoped, well-documented stage-1 infra PR. What I verified1. Errors surfaced explicitly (Core Rule 1 / Code Standard 1) — ✅
2. Secrets single-abstraction (HARD RULE) — ✅ N/A
3. Testing hygiene — ✅
4. Architecture / type-system choice — ✅
5. Repo hygiene / root cleanliness / scripts / temp artifacts — ✅
6. DRY / self-describing / nesting — ✅
Correctness confirmation (I checked the producer, not just the PR text)The
Non-blocking observations (optional, not violations)
Observation (1) is the only thing I'd suggest reconciling before merge, and it's cosmetic — the tests pass either way. Job run • Branch: |
…e int inputs
Local review pointed out three real issues on the previous commit:
1. **Widening was inconsistent.** ``AggregateMetrics.success_rate_micro/
macro``, ``avg_score_micro/macro``, and ``pass_at_*_macro`` were
widened to ``int | float | None`` in the previous commit, but every
one of them is ``sum(...) / n`` in metrics.py — always float at the
producer, matching the ``PerTaskMetrics.success_rate``/``avg_score``/
``pass@k`` siblings that stayed narrow ``float``. Narrowed the
AggregateMetrics rate fields back to ``float | None`` for symmetry.
Fields that stay widened (``total_*_tokens``, ``avg_*_tokens``,
latency percentiles, ``stuck_rate``, cost sums) are the ones with
naturally-integer semantics.
2. **The `int/float` test didn't exercise int inputs.** The previous
``test_int_float_json_string_no_drift`` fed
``calculate_task_metrics([single_trajectory])`` and compared JSON
strings. But every widened field in that path resolves to ``float``
via division — the pre-fix ``float``-only model produces
byte-identical output. The test guarded nothing. Replaced with three
sharper tests:
- ``test_int_valued_numeric_fields_preserve_int_type`` — hand-injects
``int`` values into every widened field of ``PerTaskMetrics`` and
asserts the model preserves ``int`` type at both the Python level
(``isinstance(value, int)``) and the JSON wire level (dump keys
are ``int``, not ``float``). Would fail if any widened field is
narrowed back to ``float``.
- ``test_int_valued_aggregate_fields_preserve_int_type`` — same
invariant for ``AggregateMetrics``. Also asserts that
``success_rate_micro`` STAYS narrow ``float`` — pinning the
symmetry with ``PerTaskMetrics``.
- ``test_current_producer_output_matches_model_dump_byte_for_byte``
— round-trip guard for the current live producer output across
weighted and unweighted aggregates. Catches accidental coercion
in the narrow rate fields.
3. **Unused ``info`` parameter** on ``_always_include_schema_version``.
Dropped along with the ``SerializationInfo`` import. Clarified the
docstring: the serializer injects only when the wrapped handler
omits the field (i.e. under ``exclude_unset=True`` with the
default), never overwrites a caller-set value.
Also updated the module docstring to spell out which fields ARE widened
(``int``-natural: tokens, latency, stuck) vs. which stay narrow
(rate-shaped divisions) and acknowledge that this is defensive
future-proofing — today's producer path always emits float for every
widened field because the metric-calc functions short-circuit on empty
input.
Verification: 14 tests pass on the snapshot file (up from 12); full
unit + canonical suite: 2449 pass.
|
Claude finished @CiroGamboa's task —— View job PR Review — AGENTS.md compliance
Verdict: ✅ Clean. No AGENTS.md violations found. Well-scoped stage-1 infra PR. The prior review's one open observation (the Rule-by-rule
Correctness confirmation (verified against the producer, not just PR text)
Non-blocking nits (optional, not violations)
Neither nit blocks merge; tests pass either way. Job run • Branch: |
Follow-up on the review nit — the two `import json` statements inside `test_int_valued_numeric_fields_preserve_int_type` and `test_current_producer_output_matches_model_dump_byte_for_byte` are now a single top-of-module `import json` (the rest of the file already imports at module level). Purely cosmetic; no behaviour change.
|
Claude finished @CiroGamboa's task —— View job PR Review — AGENTS.md compliance
Verdict: ✅ Clean. No AGENTS.md violations found. Well-scoped stage-1 wire-invariant PR; the prior two review rounds' open observations are all reconciled in the current diff. Rule-by-rule
Correctness confirmation (verified against the producer, not just PR text)
Non-blocking nits (optional, not violations)
Neither nit blocks merge; tests pass either way. |
TL;DR
Closes #152 and #153. Both are wire-format invariants the future writer migration (models replacing dicts in
RunAggregateWriter) will depend on. Landing them now — before the migration — means the migration PR only has to verify against tests that already exist.What ships
1.
schema_versionalways survives dumps (#152)RunAggregate.schema_versiongains a@model_serializer(mode="wrap")that unconditionally injects the field into the dumped dict when the wrapped handler omitted it — including theexclude_unset=Truecase where the field was left at its Pydantic default.Why: the current writer path (
json.dump(dict)) relies on the orchestrator manually stampingaggregate["schema_version"] = 1before write. Once the writer migrates tomodel.model_dump_json(exclude_unset=True), aRunAggregatewhoseschema_versionwas left at the model default would silently drop the envelope field — every downstream dashboard that dispatches onschema_versionwould break. The serializer closes that gap without requiring callers to remember the invariant. The default stays1and can still be passed explicitly.2. Type-preserving numeric fields (#153)
Numeric fields with naturally-integer semantics on
PerTaskMetricsandAggregateMetricschange fromfloattoint | float. Rate-shaped fields (success_rate,pass@k,avg_score, and their_micro/_macroaggregates) stay narrowfloat.Fields widened (natural integer inputs):
avg_*_tokens,total_*_tokenson both modelslatency_p50/90/99_s,api_call_latency_p50/90/99_s,latency_p50/90/99_s_macrototal_cost_usd,avg_cost_usd,judge_cost_usd,total_cost_incl_judge_usdavg_turns,avg_tool_calls,avg_latency_sstuck_rate(empirical count, not a rate)Fields kept narrow (always division results at the producer):
PerTaskMetrics.success_rate,avg_score,pass@k,pass_hat@kAggregateMetrics.success_rate_micro/macro,avg_score_micro/macro,pass_at_*_macro,pass_hat_at_*_macroRationale: Pydantic v2's smart-union picks the more specific type on validation, so a source
int 42round-trips to42(not42.0) in the JSON dump. A pre-fixfloat-only model coercesint 42→42.0on validation and emits42.0on the wire — invisible to Python-dict comparison (42 == 42.0) but real when downstream tooling diffs the JSON byte-for-byte.Honesty note: today's producers in
metrics.pyhappen to always emitfloatfor every widened field (both metric-calc functions short-circuit on empty input rather than dividing by zero). The union is defensive future-proofing — if a future refactor introduces anintreturn path (e.g. a counting aggregator that returns0for empty), the model preserves it without a coordinated model change.3. Writer docstring explicitness
RunAggregateWriter.write_aggregatedocuments both invariants in its docstring so future consumers can rely on them without needing to read the model source.Tests
Four canonical tests pin the invariants (14 total on the file after this PR, up from 10):
test_schema_version_survives_exclude_unset— constructsRunAggregatefrom a payload that omitsschema_version, dumps withexclude_unset=True, asserts the field is present in the output. Also verifies the JSON round-trip preserves the value.test_int_valued_numeric_fields_preserve_int_type— hand-injectsintvalues into every widened field ofPerTaskMetricsand asserts the model preservesintat both the Python level (isinstance(value, int)) and the JSON wire level. Would fail if any widened field is narrowed back tofloat.test_int_valued_aggregate_fields_preserve_int_type— same invariant forAggregateMetrics. Also asserts thatsuccess_rate_microSTAYS narrowfloat— pinning the symmetry withPerTaskMetrics.success_rate.test_current_producer_output_matches_model_dump_byte_for_byte— round-trip guard for the current live producer output across weighted and unweighted aggregates. Catches accidental coercion in the narrow rate fields.Tests exercise both branches of the
int | floatunion — the producer's actualfloatoutput round-trips asfloat, and hand-injectedintinputs round-trip asint.Test plan
make lintclean.uv run pytest tests/canonical/test_run_aggregate_models_snapshot.py -v— 14 pass.uv run pytest tests/ -m "unit or canonical" -q— 2449 pass.PerTaskMetrics|AggregateMetrics|RunAggregateusage acrosstolokaforge/andtests/— zero live consumers today (the models exist alongside the current dict-typed writer), so the widening is safe to land ahead of the migration.Review-driven follow-ups (this PR)
Local review of the initial commit flagged three real issues, all addressed in the second commit:
AggregateMetrics.success_rate_micro/macro,avg_score_micro/macro,pass_at_*_macrowere widened but theirPerTaskMetricssiblings stayedfloat. Narrowed back for symmetry.int/floattest didn't exerciseintinputs — the producer path only ever emits float today. Rewrote with hand-injectedintvalues so the test would actually fail if any widened field regresses tofloat.infoparameter on the serializer — dropped along with theSerializationInfoimport.What this PR does NOT do
dict[str, Any], and theFileAggregateWriterstill serialises viajson.dump(payload). This PR just pins the invariants the migration will inherit.schema_versionbump. The wire format is unchanged — theint | floatunions preserve the current dict path's output byte-for-byte for any producer that emitsint.metrics.pyandfailure_attribution.pystill returndict[str, Any].Closes
Closes #152 (schema_version survives exclude_unset).
Closes #153 (int/float dump does not diverge).