Proposal: v3 metadata storage redesign#842
Conversation
Companion to SPECIAL_CONSTRAINTS_V3.md (#841). Captures the design discussion for moving metadata (name, subscripts, parameters, description, provenance) out of per-object structs into ID-keyed SoA stores owned by ConstraintCollection<T> and Instance, and reshaping the Python *_df API around JOIN-based composition (core df + metadata df + long-format parameters/provenance) instead of wide DataFrames. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reframe the proposal as a prerequisite for #841: the proto wire shape of ConstraintMetadata (inline vs. top-level map) is deferred to #841 and decided based on the parse / serialize boundary defined by Stage 2 here. - Add a 3-stage migration diagram (Python API → Rust SoA → proto wire shape) and a "Why not Rust-first" subsection acknowledging the alternative order. - Add ParametricInstance to the Rust SDK design (also owns decision_variables: BTreeMap<...> directly). - Add a section on Python modeling-chain impact for standalone constraints ((x[0] + x[1] == 1).add_name("c") chains) with two options and a recommendation; flagged as Stage 2 concern. - Add a "Boundary changes" subsection covering From<v1::*> parse / serialize boundaries moving to the collection level, and an "Other types affected" subsection for LogicalMemoryProfile derive and pyo3-stub-gen. - Switch Python examples from `merge(on="id")` to `join()` to reflect that entries_to_dataframe sets `id` as the DataFrame index. - Move removed_reason to a separate long-format DataFrame in the Target API block (matching Solution's existing removed_reasons_df pattern). - Add concrete recommendations to all 6 open questions and add a 7th for the Stage 2 modeling-chain choice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the Stage 1 / Stage 2 / Stage 3 split with a single connected v3-alpha redesign covering Rust runtime layout and Python API surface. Major content changes: - Reframe the goal around "the same fact lives in 3 places (Rust struct, Python dict accessor, Python wide df)" and target a single source of truth on the Rust side. - Introduce Series-based collection accessors: `instance.constraints`, `decision_variables`, and the special-constraint accessors all become `pandas.Series[ID -> Object]` instead of dict / list. - Make `*_df` methods explicitly derived views: type-specific core columns extracted from the Series, joined with sidecar metadata / parameters / provenance / removed_reasons dfs. - Remove all `Constraint.name` / `.subscripts` / `.parameters` / `.description` getters from Python wrappers — the only path to metadata is the metadata df. - Drop the Stage 1 / 2 / 3 ordering discussion and the "Why not Rust-first" subsection; replace with a Breaking changes section. - Keep the proto deferral to #841, the Rust SoA design, and the open questions intact; add an open question for Series-dtype semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restructure the proposal around the back-reference architecture instead
of removing wrapper-object metadata getters:
- PyO3 wrappers are two-mode: Standalone (modeling chain, owns a staging
bag) or Attached (collection-derived, holds Py<Instance> + id and
reads/writes the SoA store via back-reference).
- Wrapper getters .name / .subscripts / .parameters / .description are
PRESERVED; they switch from owning data to reading the store. Modeling
chain (x[0] + x[1] == 1).add_name("c") still works via the staging bag.
- Drop the "wrapper getter removal" breaking change. Series + *_df reshape
remain; the Rust-side metadata field removal remains.
Other corrections from review:
- Correct the "metadata is rarely consulted" claim to acknowledge that
rust/ommx/src/sample_set/extract.rs filters on metadata.name /
.subscripts / .parameters in internal logic. Those call sites move to
collection.metadata() accessors; behavior unchanged.
- Document Series migration breakage explicitly: s.values() (method) is
not on pandas Series; s.tolist() / list(s) replace it. Also flag
s.apply() as an attractive nuisance pointing toward *_df.
- Add a layered-views diagram showing wrapper / Series / *_df as three
surfaces over the canonical Rust SoA store.
- Note that implementation should be split into multiple PRs even though
the design doc is a single piece.
- Replace open question 7 (modeling chain) and 8 (Series dtype) with
drop_constraint / wrapper invalidation and Py<Instance> cycle questions
reflecting the back-reference architecture.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Open Q1 recommendation flips to a single method with Literal["regular","indicator","one_hot","sos1"] + @overload, since pyo3-stub-gen supports emitting typing.overload stubs keyed on Literal arguments. The IDE / type-checker still sees kind-specific column schemas without four explicit method names. - Add a new "Avoiding cross-ID-space joins" subsection to the Python SDK design. Each constraint kind has its own ID space (and DV IDs are a fifth), so a naked df.join() between mismatched-kind dfs would produce a shaped-but-wrong result. Mitigations: 1. Qualified index names: variable_id, regular_constraint_id, indicator_constraint_id, one_hot_constraint_id, sos1_constraint_id. Visible in df.head() / df.info() / IDE. 2. *_full_df(kind=...) one-step joined helpers for the common "core + metadata in one frame" case. 3. Wrapper-object access (s.loc[id].name) stays the safest path for single-id lookups. MultiIndex / ExtensionArray approaches considered and rejected as disproportionate to the failure mode. - Update df example signatures to match: kind dispatched via the unified method, qualified index names visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A. Naming consistency. Update three places that still used the old
per-kind *_metadata_df / *_parameters_df notation to the unified
constraint_metadata_df(kind=...) form (Target shape, Layered views
diagram, Breaking changes).
B. Correct misleading "*_df is core+sidecars joined" framing in the
Target shape; *_df is core only. Joining is the user's job or
*_full_df's job.
C. Add ParametricInstance to the Python-side scope: same Series
accessors / *_df family as Instance, with the parametric
`parameters` field staying on its own dedicated accessor.
D. Document Solution / SampleSet *_df naming alongside Instance —
constraint_full_df / constraint_metadata_df / etc. with stage-
appropriate column schemas. Sidecar dfs are stage-independent
(same metadata across stages).
E. Specify Standalone → Attached transition semantics on
add_constraint(c): in-place flip of the wrapper's inner enum,
ValueError on double insertion, write-through visibility across
Attached wrappers pointing at the same id, PyO3 borrow rules at
&mut boundaries.
F. Confirm modeling chain matches existing API (add_name /
add_subscripts / add_description on Constraint wrapper). Builder
semantics (m.subscripts([i,j])) clarified as set-style (replace),
matching set_subscripts on the Python side.
G. Specify *_full_df argument shape: include: tuple[Literal[...]]
with options "metadata", "parameters", "provenance",
"removed_reasons". Pivoted long-format keys get namespaced wide
prefixes (parameters.{key}, provenance.{step}.source_id, …).
H. Strengthen Open Q3 (insert_with) rationale: pure Rust callers
construct constraints in loops where the two-step form silently
loses metadata; closure-based form keeps insertion atomic.
Independent of the Python staging bag.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the *_full_df helper family with an `include=` parameter on the
core *_df methods themselves. Default include keeps v2's wide-DataFrame
shape so v2 user code only needs to add the `kind=...` argument to keep
working — df["name"], df["parameters.{key}"], etc. continue to resolve.
instance.constraints_df(kind="regular")
# ≡ instance.constraints_df(kind="regular",
# include=("metadata", "parameters"))
# v2-equivalent wide DataFrame
instance.constraints_df(kind="regular", include=())
# core columns only — explicit opt-in to the slim form
instance.constraints_df(kind="regular",
include=("metadata","parameters","removed_reasons"))
# v2 + removed_reasons folded in
`include` accepts Literal["metadata","parameters","removed_reasons"].
"provenance" is intentionally excluded — variable-length chains pivot
poorly. Users who want provenance go through the long-format
constraint_provenance_df.
Same `include=` shape applies on Solution and SampleSet (stage-
appropriate core columns; sidecar dfs are stage-independent).
Updates: Target shape Python paragraph, *_df → derived views section
with the new code example showing default vs. explicit include,
Layered views diagram, Avoiding cross-ID-space joins (point 2 now
covers `include=` instead of *_full_df), Breaking changes section
(default behavior preserves v2 shape), ParametricInstance note.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Q2: long-format constraint_removed_reasons_df is the canonical store;
include=("removed_reasons",) covers the wide pivot when wanted. Already
reflected in the *_df family — confirmed.
Q5: switch the recommendation from "defer" to "reject". subscripts is
identity (the "tuple index" in x[i,j,k]-style expressions), not a
collection; exposing it as long format would invite treating positions
as first-class entities, which is the wrong mental model. Always serve
subscripts as a single List<Int64> per id on both the metadata df and
the wrapper getter. Strengthen the "subscripts / provenance
representation" subsection accordingly, contrasting with provenance
where steps are first-class entities and the long shape is natural.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolved decisions section captures the items that no longer need sign-off: - Q1 kind dispatch via Literal + @overload - Q2 removed_reason as long-format df with include= opt-in (already resolved in the previous commit; re-anchored here) - Q4 parameters as nested FnvHashMap<ID, FnvHashMap<…>> - Q5 subscripts long format rejected (already resolved) - Q6 pandas stays primary, polars deferred to v3.x - Q8 Py<Instance> lifetime is documented behavior; whether to add a weak-handle wrapper variant is left to the implementation phase Open questions section retains: - Q3 Rust insert_with builder - Q7 drop_constraint / wrapper invalidation deferral Numbering preserved from the original list for traceability with earlier review comments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User direction: once added, constraints stay in either active or removed for the lifetime of the Instance. relax / restore move between the two maps; nothing drops a constraint from the collection entirely. Workflows that need to "forget" a constraint build a fresh Instance. This is a stronger statement than the previous "do not add drop_constraint in v3" — it's a permanent property of the collection model, not a deferral. Attached wrappers are always valid; wrapper getters never panic or raise invalidation errors. Q7 moves from Open questions to Resolved decisions; the only remaining Open question is Q3 (Rust-side insert_with builder). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User catch: a borrow-bundling View type adds nothing beyond what the store's per-field getters already provide. Internal callers (extract.rs filters, Python wrapper getters) read fields one at a time and don't need a bundled borrow; bulk owned exchanges go through ConstraintMetadata directly. Final shape: - ConstraintMetadataStore<ID> exposes per-field borrowing getters (name, subscripts, parameters, description, provenance) plus collect_for(id) -> ConstraintMetadata for one-shot owned reconstruction, plus setters and bulk insert/remove that exchange with the owned struct. - ConstraintMetadata stays as the owned I/O type (same pre-v3 shape). - No separate View struct. Update the Access patterns section to spell out the store's API surface, and reduce Q3's "three types" enumeration to two. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Q3 confirmed: insert_with(c, metadata: ConstraintMetadata) is adopted alongside the flat insert + setters form. The owned ConstraintMetadata struct is reused as the I/O type — no separate Input type, no closure builder. All eight original Open Questions are now Resolved decisions. The Open questions section is empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two items called out by the final design review: - Q7's "no remove from collection" invariant has one explicit exception: parse-time hint absorption in From<v1::Instance> can drop absorbed regular constraints during parsing. That's part of the parse step, not a runtime ConstraintCollection mutation, so the runtime invariant still holds. Add a paragraph under Q7 spelling this out so the v2-proto round-trip behavior isn't mistaken for a runtime drop slipping in. - The "Solution and SampleSet share the metadata store" guarantee requires threading ConstraintMetadataStore<T::ID> through every constructor of EvaluatedCollection<T> / SampledCollection<T> and every Evaluate impl. Add this as an explicit implementation task under "Other types affected" so it isn't left implicit. Also note that hand-written stub overrides handle the Series[ID -> Object] return signatures for pyo3-stub-gen (no doc change needed beyond noting it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c64d2e400f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| `solution.constraint_metadata_df(kind=...)` returns the same data as | ||
| `instance.constraint_metadata_df(kind=...)` (and the `include=` columns | ||
| folded into `*_df` are identical across stages too). There's no per- |
There was a problem hiding this comment.
Define snapshot semantics for stage metadata stores
This section states that solution.constraint_metadata_df(...) is identical to instance.constraint_metadata_df(...) and that there is no stage divergence, but earlier in the proposal EvaluatedCollection metadata is described as copied/cloned from the parent collection (metadata: ... copied from parent collection, and store-level clone during evaluate). Combined with attached-wrapper write-through metadata mutation on Instance, these two statements conflict: mutating metadata after evaluation would make Solution/SampleSet sidecars diverge from Instance. Please explicitly choose and document one model (shared store vs snapshot-at-evaluate) so implementations and user expectations don’t silently drift.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Adds a new design document proposing an OMMX v3 metadata storage redesign, intended to unblock the v3 proto wire-shape decision for ConstraintMetadata (linked to #841) by first settling the runtime + Python API direction.
Changes:
- Introduces an SoA (Struct-of-Arrays) metadata store design in the Rust SDK, moving metadata from per-element structs to collection-owned stores.
- Proposes a Python SDK surface redesign: collection accessors return
pandas.Series[ID -> wrapper], and*_df(..., include=...)methods become explicit derived views with long-format sidecars. - Specifies wrapper object behavior (Standalone vs Attached) with back-references into the canonical Rust store.
| pub fn name(&self, id: ID) -> Option<&str>; | ||
| pub fn subscripts(&self, id: ID) -> &[i64]; // empty slice if absent | ||
| pub fn parameters(&self, id: ID) -> &FnvHashMap<String, String>; // &EMPTY_MAP if absent | ||
| pub fn description(&self, id: ID) -> Option<&str>; | ||
| pub fn provenance(&self, id: ID) -> &[Provenance]; | ||
|
|
||
| // One-shot owned reconstruction. | ||
| pub fn collect_for(&self, id: ID) -> ConstraintMetadata; |
There was a problem hiding this comment.
The proposed ConstraintMetadataStore<ID> API takes id: ID by value, but the impl bound shown is only ID: Eq + Hash. Unless you want to require ID: Copy (which matches current ID newtypes), this should likely take &ID to avoid forcing moves/clones for non-Copy IDs.
| pub fn name(&self, id: ID) -> Option<&str>; | |
| pub fn subscripts(&self, id: ID) -> &[i64]; // empty slice if absent | |
| pub fn parameters(&self, id: ID) -> &FnvHashMap<String, String>; // &EMPTY_MAP if absent | |
| pub fn description(&self, id: ID) -> Option<&str>; | |
| pub fn provenance(&self, id: ID) -> &[Provenance]; | |
| // One-shot owned reconstruction. | |
| pub fn collect_for(&self, id: ID) -> ConstraintMetadata; | |
| pub fn name(&self, id: &ID) -> Option<&str>; | |
| pub fn subscripts(&self, id: &ID) -> &[i64]; // empty slice if absent | |
| pub fn parameters(&self, id: &ID) -> &FnvHashMap<String, String>; // &EMPTY_MAP if absent | |
| pub fn description(&self, id: &ID) -> Option<&str>; | |
| pub fn provenance(&self, id: &ID) -> &[Provenance]; | |
| // One-shot owned reconstruction. | |
| pub fn collect_for(&self, id: &ID) -> ConstraintMetadata; |
| - **Parsing** (`From<v1::Instance> for Instance`, etc., in | ||
| `rust/ommx/src/instance/parse.rs` and | ||
| `rust/ommx/src/constraint/parse.rs`). Today each element is parsed with | ||
| its metadata; after the refactor, parsing emits bare elements and a | ||
| populated `*MetadataStore`. The natural locus is `From<v1::Instance>` | ||
| and `From<Vec<v1::Constraint>> for ConstraintCollection<...>`. | ||
| - **Serialization** (`From<&Instance> for v1::Instance`, |
There was a problem hiding this comment.
This section describes current parse/serialize boundaries as From<v1::Instance> for Instance / From<Vec<v1::Constraint>> … / From<&Instance> for v1::Instance, but the codebase currently uses the Parse trait + TryFrom<v1::Instance> and From<Instance> for v1::Instance (e.g. rust/ommx/src/instance/parse.rs, rust/ommx/src/constraint/parse.rs). Consider updating the doc to match the existing conversion pattern so implementation tasks are unambiguous.
| - **Parsing** (`From<v1::Instance> for Instance`, etc., in | |
| `rust/ommx/src/instance/parse.rs` and | |
| `rust/ommx/src/constraint/parse.rs`). Today each element is parsed with | |
| its metadata; after the refactor, parsing emits bare elements and a | |
| populated `*MetadataStore`. The natural locus is `From<v1::Instance>` | |
| and `From<Vec<v1::Constraint>> for ConstraintCollection<...>`. | |
| - **Serialization** (`From<&Instance> for v1::Instance`, | |
| - **Parsing** (the `Parse` trait + `TryFrom<v1::Instance>` / related | |
| conversions in `rust/ommx/src/instance/parse.rs` and | |
| `rust/ommx/src/constraint/parse.rs`). Today each element is parsed with | |
| its metadata; after the refactor, parsing emits bare elements and a | |
| populated `*MetadataStore`. The natural locus is the existing | |
| collection-level parse entry points (`TryFrom<v1::Instance>` and the | |
| constraint collection parsers), rather than per-element metadata parsing. | |
| - **Serialization** (`From<Instance> for v1::Instance`, |
| A new section in `PYTHON_SDK_MIGRATION_GUIDE.md` covers the Python side | ||
| in detail. |
There was a problem hiding this comment.
The doc states “A new section in PYTHON_SDK_MIGRATION_GUIDE.md covers the Python side in detail,” but that section isn’t present in the repo yet. Consider rephrasing in future tense (or linking an issue/PR) so readers don’t assume the migration guide has already been updated.
| A new section in `PYTHON_SDK_MIGRATION_GUIDE.md` covers the Python side | |
| in detail. | |
| A new section in `PYTHON_SDK_MIGRATION_GUIDE.md` will cover the Python | |
| side in detail. |
| pub struct EvaluatedCollection<T: ConstraintType> { | ||
| constraints: BTreeMap<T::ID, T::Evaluated>, | ||
| removed_reasons: BTreeMap<T::ID, RemovedReason>, | ||
| metadata: ConstraintMetadataStore<T::ID>, // new — copied from parent collection | ||
| } |
There was a problem hiding this comment.
The design alternates between describing Solution/SampleSet metadata as a single shared source (“share one metadata source per collection”) and describing it as copied into EvaluatedCollection (“copied from parent collection”). Please clarify whether evaluated/sampled views hold a shared reference (so later metadata edits are reflected) or a snapshot clone (and document the staleness semantics).
| // Per-field borrows. Static EMPTY_* sentinels cover the absent | ||
| // case so the Option<FnvHashMap<…>> storage doesn't leak through | ||
| // the public API. |
There was a problem hiding this comment.
The comment about “Option<FnvHashMap<…>> storage” leaking through the API seems inconsistent with the proposed store layout (which uses FnvHashMap<ID, …> fields rather than Option<…>). Suggest rewording to describe the actual “missing entry” case and how the EMPTY_* sentinels are used.
| // Per-field borrows. Static EMPTY_* sentinels cover the absent | |
| // case so the Option<FnvHashMap<…>> storage doesn't leak through | |
| // the public API. | |
| // Per-field borrows. If an ID has no entry in a field map, the | |
| // getter returns None or a static EMPTY_* sentinel as appropriate, | |
| // so missing-entry handling does not leak through the public API. |
Summary
ConstraintMetadata(inline per message vs. top-level columnar map) cannot be finalized until the runtime / Python-API direction here is settled.Target shape
ConstraintCollection<T>ownsConstraintMetadataStore<T::ID>;InstanceandParametricInstanceownVariableMetadataStoredirectly.DecisionVariableandConstraint<S>(and the special-constraint variants) lose theirmetadatafield. Parse / serialize boundaries move from per-element to per-collection. Internal call sites that readc.metadata.*(e.g.sample_set/extract.rs) switch tocollection.metadata()accessors.instance.constraints,decision_variables, and the special-constraint accessors becomepandas.Series[ID -> Object]instead ofdict/list.*_dfmethods (constraints_df(kind=..., include=...),decision_variables_df(include=...), and Solution / SampleSet counterparts) take aninclude=parameter selecting which sidecars ("metadata","parameters","removed_reasons") to fold in. Defaultinclude=("metadata","parameters")reproduces v2's wide-DataFrame shape — v2 user code keeps working with just akind=...argument added.constraint_metadata_df,constraint_parameters_df,constraint_provenance_df,constraint_removed_reasons_df,variable_metadata_df,variable_parameters_df) are bulk-built from the SoA store.provenanceis intentionally not included viainclude=(variable-length chains pivot poorly) and is only available as the long-format df.indicator_constraints_df/one_hot_constraints_df/sos1_constraints_dfcollapse into the singleconstraints_df(kind=...)overload set, dispatched viaLiteral+@overloadso the IDE / type checker still sees kind-specific column schemas.Py<Instance>+ id and reads / writes the SoA store via back-reference). Getters.name,.subscripts,.parameters,.descriptionare preserved; they switch from owning data to reading the store. The modeling chain(x[0] + x[1] == 1).add_name("c")keeps working through the staging bag, which drains into the SoA store on insertion.variable_id,regular_constraint_id,indicator_constraint_id,one_hot_constraint_id,sos1_constraint_id). The defaultinclude=covers most "I want a wide table" cases without manual join, removing the most common opportunity for a wrong-kind merge.Relationship to #841
#842 first, #841 second. The proto-schema work in #841 currently sketches
ConstraintMetadatainline per constraint message but defers the wire-shape decision (inline AoS vs. top-levelmap<uint64, ConstraintMetadata>per collection). That choice becomes concrete only after this proposal lands and the parse / serialize boundary has a clean shape.Open questions (with recommendations)
See the bottom of the doc for full reasoning.
Literal+@overload.removed_reasonplacement — rec: separate long-format df (and"removed_reasons"opt-in viainclude=for the wide form).insert_withon the Rust side (independent of the Python staging bag).parametersRust-side storage — rec: nestedFnvHashMap<ID, FnvHashMap<…>>.subscripts_dflong format — rec: defer.drop_constraint/ wrapper invalidation — rec: do not adddrop_constraintin v3; defer the invalidation semantics until it's actually needed.Py<Instance>cycles — rec: documented behavior, no code-level mitigation.Test plan
*_df(include=...)Python API, and the v2-compatible default.dict→Seriesmigration and the*_dfinclude=reshape.🤖 Generated with Claude Code