From 5200e5843e0229bd17a1531e63d05ccf9e44f074 Mon Sep 17 00:00:00 2001 From: che cheng Date: Sat, 2 May 2026 06:44:07 +0800 Subject: [PATCH 1/2] docs: propose script-variant anchor matching Refs PsychQuant/che-word-mcp#90 --- .../.openspec.yaml | 4 + .../script-variant-anchor-matching/design.md | 126 ++++++++++++++++++ .../proposal.md | 59 ++++++++ .../che-word-mcp-insertion-tools/spec.md | 67 ++++++++++ .../specs/ooxml-paragraph-text-mirror/spec.md | 69 ++++++++++ .../script-variant-anchor-matching/tasks.md | 12 ++ 6 files changed, 337 insertions(+) create mode 100644 openspec/changes/script-variant-anchor-matching/.openspec.yaml create mode 100644 openspec/changes/script-variant-anchor-matching/design.md create mode 100644 openspec/changes/script-variant-anchor-matching/proposal.md create mode 100644 openspec/changes/script-variant-anchor-matching/specs/che-word-mcp-insertion-tools/spec.md create mode 100644 openspec/changes/script-variant-anchor-matching/specs/ooxml-paragraph-text-mirror/spec.md create mode 100644 openspec/changes/script-variant-anchor-matching/tasks.md diff --git a/openspec/changes/script-variant-anchor-matching/.openspec.yaml b/openspec/changes/script-variant-anchor-matching/.openspec.yaml new file mode 100644 index 0000000..9d6f220 --- /dev/null +++ b/openspec/changes/script-variant-anchor-matching/.openspec.yaml @@ -0,0 +1,4 @@ +schema: spec-driven +created: 2026-05-02 +created_by: che cheng +created_with: codex diff --git a/openspec/changes/script-variant-anchor-matching/design.md b/openspec/changes/script-variant-anchor-matching/design.md new file mode 100644 index 0000000..de46ca7 --- /dev/null +++ b/openspec/changes/script-variant-anchor-matching/design.md @@ -0,0 +1,126 @@ +## Design + +This change selects the "matcher option" path instead of changing visible text output. + +### Public API Shape + +OOXMLSwift should introduce a public options type for text-anchor lookup, for example: + +```swift +public struct AnchorLookupOptions: Sendable, Equatable { + public var mathScriptInsensitive: Bool + + public static let exact = AnchorLookupOptions() +} +``` + +The existing public anchor lookup entry point, currently represented by `findBodyChildContainingText`, should accept this options value while preserving the current default: + +```swift +findBodyChildContainingText(_ text: String, instance: Int = 1, options: AnchorLookupOptions = .exact) +``` + +Naming rationale: + +- `mathScriptInsensitive` is narrower than `unicodeInsensitive`. +- `scriptVariantInsensitive` is technically accurate but easier to confuse with programming scripts at the MCP schema boundary. +- `math_script_insensitive` is explicit in JSON and matches the use case that triggered #90. + +### Normalization Contract + +When `mathScriptInsensitive == true`, both haystack and needle are transformed by the same canonicalization helper before searching. + +The helper should map Unicode subscript and superscript variants to the closest ASCII representation: + +- Subscript/superscript digits `₀..₉`, `⁰..⁹` -> `0..9` +- Script signs and grouping characters such as `₊`, `⁺`, `₋`, `⁻`, `₌`, `⁼`, `₍`, `⁽`, `₎`, `⁾` -> `+`, `-`, `=`, `(`, `)` +- Common Unicode subscript/superscript Latin letters that have clear compatibility forms -> their ASCII letters + +The mapping should be explicit and test-pinned. Characters outside the table are preserved unchanged. This avoids turning the feature into broad Unicode folding. + +Because insertion anchors only need a body-child location, the first implementation does not need a normalized-span to original-span map. If a future exact-span mutator reuses the helper, that follow-up must define span mapping separately. + +### Matching Semantics + +The option is bidirectional: + +- Needle `H₀` matches haystack `H0` +- Needle `H0` matches haystack `H₀` +- Needle `xᵢ` matches haystack `xi` +- Needle `xi` matches haystack `xᵢ` + +`instance` / `text_instance` semantics are applied after normalization. In other words, the nth match is counted in the normalized text universe. + +Default exact matching remains unchanged, so existing tests for literal matching should continue to pass. + +### MCP Schema Shape + +Use a nested `match_options` object rather than one-off booleans on every insertion tool. + +```json +{ + "type": "object", + "properties": { + "match_options": { + "type": "object", + "properties": { + "math_script_insensitive": { + "type": "boolean", + "default": false + } + } + } + } +} +``` + +Reasons: + +- Leaves room for later exact, case, diacritic, or whitespace options without adding flat parameter clutter. +- Makes it clear the option modifies anchor matching rather than insertion behavior. +- Lets all anchor-based tools share identical schema text and parser code. + +The parser should treat omitted `match_options` and omitted `math_script_insensitive` as `false`. + +### Tool Scope + +Initial che-word-mcp scope: + +- `insert_paragraph` `after_text` / `before_text` +- `insert_equation` display mode `after_text` / `before_text` +- `insert_image_from_path` `after_text` / `before_text` +- `insert_caption` `after_text` / `before_text` + +Inline `insert_equation` anchor behavior remains governed by the existing inline-mode contract and is not expanded here. + +### Backward Compatibility + +This is a non-breaking additive option: + +- Existing requests without `match_options` keep exact matching. +- Existing visible text output remains unchanged. +- Existing schema fields remain valid. +- Existing callers that already pass `H0` keep working. + +### Test Strategy + +OOXMLSwift tests: + +- Normalization helper maps pinned subscript/superscript characters to ASCII. +- Exact mode does not match `H₀` against `H0`. +- `mathScriptInsensitive` mode matches `H₀` against `H0` and `H0` against `H₀`. +- `instance` selection counts normalized matches correctly when multiple candidates exist. +- Unsupported characters are preserved. + +che-word-mcp tests: + +- `tools/list` schema exposes `match_options.math_script_insensitive` on the 4 insertion-anchor tools. +- Omitted `match_options` keeps exact matching. +- Enabled option threads through Direct Mode and Session Mode insertion calls. +- Tool result/error wording remains attributable to the original tool. + +### Implementation Notes + +Keep the normalization helper small and shared. Avoid scattering per-tool string replacement logic in che-word-mcp; MCP should only parse JSON into the library option. + +If implementation discovers that `findBodyChildContainingText` is not the single shared anchor lookup path, first consolidate only the minimal anchor path needed for the 4 scoped tools. Do not broaden the change into a search/replace rewrite without a follow-up proposal. diff --git a/openspec/changes/script-variant-anchor-matching/proposal.md b/openspec/changes/script-variant-anchor-matching/proposal.md new file mode 100644 index 0000000..fab11b3 --- /dev/null +++ b/openspec/changes/script-variant-anchor-matching/proposal.md @@ -0,0 +1,59 @@ +## Problem + +PsychQuant/che-word-mcp#90 exposed a mismatch between user-facing math notation and the current OOXML text universe used for insertion anchors. + +The concrete symptom is `H_0` written as Unicode subscript text by a user (`H₀`) not matching the flattened OMML visible text emitted by the library (`H0`). The anchor text is semantically the same for thesis/advisor review workflows, but the current `String.contains`-style lookup treats the two strings as unrelated. + +This is not only a che-word-mcp schema issue. The root behavior belongs in the shared OOXML anchor lookup layer so MCP insertion tools, direct OOXMLSwift callers, and future search flows do not each invent their own Unicode workaround. + +## Root Cause + +`MathSubSuperScript.visibleText` intentionally emits a plain ASCII mirror such as `H0` for `H₀`. That output is useful for stable text extraction and should not be changed casually. + +The anchor lookup path then compares caller-provided text directly against the flattened display text. Because neither side is normalized, these equivalent math-script variants fail bidirectionally: + +- User anchor `H₀` vs flattened text `H0` +- User anchor `H0` vs document text that already contains Unicode subscript `H₀` + +The bug is therefore a missing matching-mode contract, not a reason to change the default visible-text representation. + +## Proposed Solution + +Add an opt-in, bidirectional math-script-variant matching mode. + +1. Add an OOXMLSwift anchor lookup option, tentatively named `mathScriptInsensitive`. +2. Keep default matching byte/string-exact so existing callers see no behavior change. +3. When the option is enabled, normalize both haystack and needle into a canonical ASCII math-script form before matching. +4. Preserve `MathSubSuperScript.visibleText` output and all read/export defaults. +5. Surface the option through che-word-mcp insertion anchor tools as a future-proof `match_options` object: + +```json +{ + "after_text": "H₀", + "match_options": { + "math_script_insensitive": true + } +} +``` + +The initial MCP scope is the insertion-anchor family that resolves `after_text` / `before_text`: `insert_paragraph`, `insert_equation` display mode, `insert_image_from_path`, and `insert_caption`. + +## Non-Goals + +- Do not change `MathSubSuperScript.visibleText` output. +- Do not change default anchor lookup behavior. +- Do not introduce broad Unicode normalization such as NFC/NFD folding in this change. +- Do not make approximate/fuzzy text matching. +- Do not change render/page-layout behavior. +- Do not redesign paragraph index semantics. +- Do not bundle unrelated `replace_text` behavior unless it already consumes the same shared anchor lookup API during implementation. + +## Stakes + +For thesis/advisor review workflows, equations are often referenced in natural prose as `H₀`, `αᵢ`, or similar notation. If agents cannot anchor insertions near those expressions, comment insertion and advisor-response workflows become brittle exactly where precision matters most. + +The main risk is silent over-matching. Making the mode explicit, default-off, and limited to math script variants keeps the contract reviewable and avoids surprising callers that depend on exact anchors. + +## Issue Link + +Refs PsychQuant/che-word-mcp#90. diff --git a/openspec/changes/script-variant-anchor-matching/specs/che-word-mcp-insertion-tools/spec.md b/openspec/changes/script-variant-anchor-matching/specs/che-word-mcp-insertion-tools/spec.md new file mode 100644 index 0000000..f113b63 --- /dev/null +++ b/openspec/changes/script-variant-anchor-matching/specs/che-word-mcp-insertion-tools/spec.md @@ -0,0 +1,67 @@ +# che-word-mcp-insertion-tools Specification — script-variant-anchor-matching delta + +## ADDED Requirements + +### Requirement: Insertion text anchors expose math-script-insensitive matching as an opt-in match option + +The che-word-mcp insertion tools that accept text anchors SHALL expose a `match_options` object with a boolean `math_script_insensitive` field. Omitted `match_options` and omitted `math_script_insensitive` SHALL behave as `false`. + +The option SHALL apply only to text-anchor resolution for `after_text` and `before_text`. It SHALL NOT change paragraph-index anchors, table-cell anchors, image/table anchors, inserted content, or success response wording. + +The initial scoped tools are: + +| Tool | Scoped anchors | +|---|---| +| `insert_paragraph` | `after_text`, `before_text` | +| `insert_equation` display mode | `after_text`, `before_text` | +| `insert_image_from_path` | `after_text`, `before_text` | +| `insert_caption` | `after_text`, `before_text` | + +#### Scenario: Schema exposes match_options on insertion tools + +- **WHEN** `tools/list` is requested +- **THEN** each scoped insertion tool schema includes `match_options.math_script_insensitive` +- **AND** the schema describes the default as exact matching + +#### Scenario: Omitted match_options keeps exact matching + +- **GIVEN** a document paragraph whose flattened text contains `"H0"` +- **WHEN** `insert_paragraph({ doc_id, text: "note", after_text: "H₀" })` is called without `match_options` +- **THEN** the tool follows the existing exact matching behavior and reports text not found + +#### Scenario: Enabled option matches Unicode subscript anchor to ASCII flattened text + +- **GIVEN** a document paragraph whose flattened text contains `"H0"` +- **WHEN** `insert_paragraph({ doc_id, text: "note", after_text: "H₀", match_options: { math_script_insensitive: true } })` is called +- **THEN** the tool inserts after the matched paragraph + +#### Scenario: Enabled option matches ASCII anchor to Unicode subscript text + +- **GIVEN** a document paragraph whose flattened text contains `"H₀"` +- **WHEN** `insert_caption({ doc_id, label: "Equation", caption_text: "Null hypothesis", after_text: "H0", match_options: { math_script_insensitive: true } })` is called +- **THEN** the caption is inserted after the matched paragraph + +#### Scenario: Option does not affect non-text anchors + +- **WHEN** `insert_image_from_path({ doc_id, path, index: 0, match_options: { math_script_insensitive: true } })` is called +- **THEN** the `index` anchor behavior is unchanged + +--- + +### Requirement: MCP match_options MUST thread through the shared OOXML anchor lookup option + +che-word-mcp SHALL parse `match_options.math_script_insensitive` into the shared OOXMLSwift anchor lookup option rather than implementing per-tool Unicode replacement in the MCP layer. + +Direct Mode and Session Mode SHALL use the same parser and matching semantics. + +#### Scenario: Direct Mode and Session Mode agree + +- **GIVEN** the same document content and the same `after_text: "H₀"` request with `match_options.math_script_insensitive: true` +- **WHEN** the insertion is run once with `source_path` Direct Mode and once with `doc_id` Session Mode +- **THEN** both modes resolve the same target paragraph + +#### Scenario: Invalid match_options type is rejected by existing schema validation + +- **WHEN** a caller passes `match_options` as a non-object value +- **THEN** the request is rejected consistently with existing MCP schema/type validation behavior +- **AND** the server does not silently reinterpret the invalid value as enabled matching diff --git a/openspec/changes/script-variant-anchor-matching/specs/ooxml-paragraph-text-mirror/spec.md b/openspec/changes/script-variant-anchor-matching/specs/ooxml-paragraph-text-mirror/spec.md new file mode 100644 index 0000000..e545df8 --- /dev/null +++ b/openspec/changes/script-variant-anchor-matching/specs/ooxml-paragraph-text-mirror/spec.md @@ -0,0 +1,69 @@ +# ooxml-paragraph-text-mirror Specification — script-variant-anchor-matching delta + +## ADDED Requirements + +### Requirement: Anchor lookup MAY normalize math script variants when explicitly requested + +OOXMLSwift paragraph/body-child text anchor lookup SHALL support an explicit option for math-script-insensitive matching. The default lookup mode SHALL remain exact matching. + +When math-script-insensitive matching is enabled, the lookup implementation SHALL normalize both the searched document text (haystack) and the caller-provided anchor text (needle) through the same math-script canonicalization table before matching. + +The canonicalization table SHALL map supported Unicode subscript and superscript digits, signs, grouping characters, and common Latin letters to their closest ASCII representation. Characters not listed in the table SHALL remain unchanged. + +The option MUST NOT change `Paragraph.flattenedDisplayText()`, OMML `visibleText`, export text output, or any other read API output. + +#### Scenario: Default exact mode preserves current mismatch + +- **GIVEN** a paragraph whose flattened display text is `"H0 is rejected"` +- **WHEN** anchor lookup searches for `"H₀"` with default options +- **THEN** no match is returned + +#### Scenario: Unicode subscript needle matches ASCII haystack when enabled + +- **GIVEN** a paragraph whose flattened display text is `"H0 is rejected"` +- **WHEN** anchor lookup searches for `"H₀"` with math-script-insensitive matching enabled +- **THEN** the paragraph's body-child index is returned + +#### Scenario: ASCII needle matches Unicode subscript haystack when enabled + +- **GIVEN** a paragraph whose flattened display text is `"H₀ is rejected"` +- **WHEN** anchor lookup searches for `"H0"` with math-script-insensitive matching enabled +- **THEN** the paragraph's body-child index is returned + +#### Scenario: Superscript and subscript letters normalize consistently + +- **GIVEN** a paragraph whose flattened display text is `"xᵢ + y²"` +- **WHEN** anchor lookup searches for `"xi + y2"` with math-script-insensitive matching enabled +- **THEN** the paragraph's body-child index is returned + +#### Scenario: nth-instance selection counts normalized matches + +- **GIVEN** body paragraphs flatten to `["H0 first", "H₀ second"]` +- **WHEN** anchor lookup searches for `"H₀"` with instance `2` and math-script-insensitive matching enabled +- **THEN** the second paragraph's body-child index is returned + +#### Scenario: Unsupported characters are preserved + +- **GIVEN** a paragraph containing an unsupported Unicode symbol that is not in the math-script canonicalization table +- **WHEN** math-script-insensitive lookup runs +- **THEN** that symbol remains unchanged for matching purposes + +--- + +### Requirement: Math visible text output remains unchanged by script-variant matching + +Math-script-insensitive matching SHALL be implemented as a lookup-time normalization mode only. It SHALL NOT alter OMML visible-text generation or paragraph flattening output. + +#### Scenario: OMML subscript visible text remains ASCII + +- **GIVEN** an OMML subscript expression representing `H₀` +- **WHEN** its visible text is read +- **THEN** the output remains the existing ASCII mirror, such as `"H0"` +- **AND** no Unicode subscript character is introduced by the matcher option + +#### Scenario: flattenedDisplayText is unaffected + +- **GIVEN** a paragraph containing the OMML expression for `H₀` +- **WHEN** `flattenedDisplayText()` is called +- **THEN** the result remains identical to exact-mode behavior +- **AND** enabling math-script-insensitive lookup elsewhere does not mutate the paragraph or its flattened output diff --git a/openspec/changes/script-variant-anchor-matching/tasks.md b/openspec/changes/script-variant-anchor-matching/tasks.md new file mode 100644 index 0000000..9622881 --- /dev/null +++ b/openspec/changes/script-variant-anchor-matching/tasks.md @@ -0,0 +1,12 @@ +## Tasks + +- [ ] Add OOXMLSwift `AnchorLookupOptions` or equivalent public options type with exact matching as the default. +- [ ] Implement a shared math-script canonicalization helper with explicit mapping for supported Unicode subscript/superscript characters. +- [ ] Thread the options type through `findBodyChildContainingText` and the insertion-anchor call sites that depend on it. +- [ ] Add OOXMLSwift tests for exact mode, math-script-insensitive mode, bidirectional matching, nth-instance behavior, and unsupported-character preservation. +- [ ] Add che-word-mcp `match_options.math_script_insensitive` schema support to the 4 scoped insertion tools. +- [ ] Parse MCP `match_options` into the OOXMLSwift anchor lookup option in Direct Mode and Session Mode. +- [ ] Add che-word-mcp tests proving schema exposure, default exact behavior, and successful `H₀` / `H0` insertion anchors when the option is enabled. +- [ ] Update README/tool documentation with the exact matching default and opt-in math-script matching examples. +- [ ] Verify with targeted Swift tests in both affected repos. +- [ ] Open implementation PRs referencing PsychQuant/che-word-mcp#90. From e4742f1f3858a98bc4002d3940f43c413ef2c546 Mon Sep 17 00:00:00 2001 From: che cheng Date: Sat, 2 May 2026 07:05:36 +0800 Subject: [PATCH 2/2] docs: mark script-variant apply tasks complete Refs PsychQuant/che-word-mcp#90 --- .../script-variant-anchor-matching/tasks.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/openspec/changes/script-variant-anchor-matching/tasks.md b/openspec/changes/script-variant-anchor-matching/tasks.md index 9622881..a55c512 100644 --- a/openspec/changes/script-variant-anchor-matching/tasks.md +++ b/openspec/changes/script-variant-anchor-matching/tasks.md @@ -1,12 +1,12 @@ ## Tasks -- [ ] Add OOXMLSwift `AnchorLookupOptions` or equivalent public options type with exact matching as the default. -- [ ] Implement a shared math-script canonicalization helper with explicit mapping for supported Unicode subscript/superscript characters. -- [ ] Thread the options type through `findBodyChildContainingText` and the insertion-anchor call sites that depend on it. -- [ ] Add OOXMLSwift tests for exact mode, math-script-insensitive mode, bidirectional matching, nth-instance behavior, and unsupported-character preservation. -- [ ] Add che-word-mcp `match_options.math_script_insensitive` schema support to the 4 scoped insertion tools. -- [ ] Parse MCP `match_options` into the OOXMLSwift anchor lookup option in Direct Mode and Session Mode. -- [ ] Add che-word-mcp tests proving schema exposure, default exact behavior, and successful `H₀` / `H0` insertion anchors when the option is enabled. -- [ ] Update README/tool documentation with the exact matching default and opt-in math-script matching examples. -- [ ] Verify with targeted Swift tests in both affected repos. -- [ ] Open implementation PRs referencing PsychQuant/che-word-mcp#90. +- [x] Add OOXMLSwift `AnchorLookupOptions` or equivalent public options type with exact matching as the default. +- [x] Implement a shared math-script canonicalization helper with explicit mapping for supported Unicode subscript/superscript characters. +- [x] Thread the options type through `findBodyChildContainingText` and the insertion-anchor call sites that depend on it. +- [x] Add OOXMLSwift tests for exact mode, math-script-insensitive mode, bidirectional matching, nth-instance behavior, and unsupported-character preservation. +- [x] Add che-word-mcp `match_options.math_script_insensitive` schema support to the 4 scoped insertion tools. +- [x] Parse MCP `match_options` into the OOXMLSwift anchor lookup option for the scoped insertion tools. Current insertion mutators are Session Mode only; no `source_path` mutating Direct Mode surface exists in this change. +- [x] Add che-word-mcp tests proving schema exposure, default exact behavior, and successful `H₀` / `H0` insertion anchors when the option is enabled. +- [x] Update README/tool documentation with the exact matching default and opt-in math-script matching examples. +- [x] Verify with targeted Swift tests in both affected repos. +- [x] Open implementation PRs referencing PsychQuant/che-word-mcp#90.