Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
schema: spec-driven
created: 2026-05-02
created_by: che cheng <kiki830621@gmail.com>
created_with: codex
126 changes: 126 additions & 0 deletions openspec/changes/script-variant-anchor-matching/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
## Design

This change selects the "matcher option" path instead of changing visible text output.

### Public API Shape

OOXMLSwift should introduce a public options type for text-anchor lookup, for example:

```swift
public struct AnchorLookupOptions: Sendable, Equatable {
public var mathScriptInsensitive: Bool

public static let exact = AnchorLookupOptions()
}
```

The existing public anchor lookup entry point, currently represented by `findBodyChildContainingText`, should accept this options value while preserving the current default:

```swift
findBodyChildContainingText(_ text: String, instance: Int = 1, options: AnchorLookupOptions = .exact)
```

Naming rationale:

- `mathScriptInsensitive` is narrower than `unicodeInsensitive`.
- `scriptVariantInsensitive` is technically accurate but easier to confuse with programming scripts at the MCP schema boundary.
- `math_script_insensitive` is explicit in JSON and matches the use case that triggered #90.

### Normalization Contract

When `mathScriptInsensitive == true`, both haystack and needle are transformed by the same canonicalization helper before searching.

The helper should map Unicode subscript and superscript variants to the closest ASCII representation:

- Subscript/superscript digits `₀..₉`, `⁰..⁹` -> `0..9`
- Script signs and grouping characters such as `₊`, `⁺`, `₋`, `⁻`, `₌`, `⁼`, `₍`, `⁽`, `₎`, `⁾` -> `+`, `-`, `=`, `(`, `)`
- Common Unicode subscript/superscript Latin letters that have clear compatibility forms -> their ASCII letters

The mapping should be explicit and test-pinned. Characters outside the table are preserved unchanged. This avoids turning the feature into broad Unicode folding.

Because insertion anchors only need a body-child location, the first implementation does not need a normalized-span to original-span map. If a future exact-span mutator reuses the helper, that follow-up must define span mapping separately.

### Matching Semantics

The option is bidirectional:

- Needle `H₀` matches haystack `H0`
- Needle `H0` matches haystack `H₀`
- Needle `xᵢ` matches haystack `xi`
- Needle `xi` matches haystack `xᵢ`

`instance` / `text_instance` semantics are applied after normalization. In other words, the nth match is counted in the normalized text universe.

Default exact matching remains unchanged, so existing tests for literal matching should continue to pass.

### MCP Schema Shape

Use a nested `match_options` object rather than one-off booleans on every insertion tool.

```json
{
"type": "object",
"properties": {
"match_options": {
"type": "object",
"properties": {
"math_script_insensitive": {
"type": "boolean",
"default": false
}
}
}
}
}
```

Reasons:

- Leaves room for later exact, case, diacritic, or whitespace options without adding flat parameter clutter.
- Makes it clear the option modifies anchor matching rather than insertion behavior.
- Lets all anchor-based tools share identical schema text and parser code.

The parser should treat omitted `match_options` and omitted `math_script_insensitive` as `false`.

### Tool Scope

Initial che-word-mcp scope:

- `insert_paragraph` `after_text` / `before_text`
- `insert_equation` display mode `after_text` / `before_text`
- `insert_image_from_path` `after_text` / `before_text`
- `insert_caption` `after_text` / `before_text`

Inline `insert_equation` anchor behavior remains governed by the existing inline-mode contract and is not expanded here.

### Backward Compatibility

This is a non-breaking additive option:

- Existing requests without `match_options` keep exact matching.
- Existing visible text output remains unchanged.
- Existing schema fields remain valid.
- Existing callers that already pass `H0` keep working.

### Test Strategy

OOXMLSwift tests:

- Normalization helper maps pinned subscript/superscript characters to ASCII.
- Exact mode does not match `H₀` against `H0`.
- `mathScriptInsensitive` mode matches `H₀` against `H0` and `H0` against `H₀`.
- `instance` selection counts normalized matches correctly when multiple candidates exist.
- Unsupported characters are preserved.

che-word-mcp tests:

- `tools/list` schema exposes `match_options.math_script_insensitive` on the 4 insertion-anchor tools.
- Omitted `match_options` keeps exact matching.
- Enabled option threads through Direct Mode and Session Mode insertion calls.
- Tool result/error wording remains attributable to the original tool.

### Implementation Notes

Keep the normalization helper small and shared. Avoid scattering per-tool string replacement logic in che-word-mcp; MCP should only parse JSON into the library option.

If implementation discovers that `findBodyChildContainingText` is not the single shared anchor lookup path, first consolidate only the minimal anchor path needed for the 4 scoped tools. Do not broaden the change into a search/replace rewrite without a follow-up proposal.
59 changes: 59 additions & 0 deletions openspec/changes/script-variant-anchor-matching/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
## Problem

PsychQuant/che-word-mcp#90 exposed a mismatch between user-facing math notation and the current OOXML text universe used for insertion anchors.

The concrete symptom is `H_0` written as Unicode subscript text by a user (`H₀`) not matching the flattened OMML visible text emitted by the library (`H0`). The anchor text is semantically the same for thesis/advisor review workflows, but the current `String.contains`-style lookup treats the two strings as unrelated.

This is not only a che-word-mcp schema issue. The root behavior belongs in the shared OOXML anchor lookup layer so MCP insertion tools, direct OOXMLSwift callers, and future search flows do not each invent their own Unicode workaround.

## Root Cause

`MathSubSuperScript.visibleText` intentionally emits a plain ASCII mirror such as `H0` for `H₀`. That output is useful for stable text extraction and should not be changed casually.

The anchor lookup path then compares caller-provided text directly against the flattened display text. Because neither side is normalized, these equivalent math-script variants fail bidirectionally:

- User anchor `H₀` vs flattened text `H0`
- User anchor `H0` vs document text that already contains Unicode subscript `H₀`

The bug is therefore a missing matching-mode contract, not a reason to change the default visible-text representation.

## Proposed Solution

Add an opt-in, bidirectional math-script-variant matching mode.

1. Add an OOXMLSwift anchor lookup option, tentatively named `mathScriptInsensitive`.
2. Keep default matching byte/string-exact so existing callers see no behavior change.
3. When the option is enabled, normalize both haystack and needle into a canonical ASCII math-script form before matching.
4. Preserve `MathSubSuperScript.visibleText` output and all read/export defaults.
5. Surface the option through che-word-mcp insertion anchor tools as a future-proof `match_options` object:

```json
{
"after_text": "H₀",
"match_options": {
"math_script_insensitive": true
}
}
```

The initial MCP scope is the insertion-anchor family that resolves `after_text` / `before_text`: `insert_paragraph`, `insert_equation` display mode, `insert_image_from_path`, and `insert_caption`.

## Non-Goals

- Do not change `MathSubSuperScript.visibleText` output.
- Do not change default anchor lookup behavior.
- Do not introduce broad Unicode normalization such as NFC/NFD folding in this change.
- Do not make approximate/fuzzy text matching.
- Do not change render/page-layout behavior.
- Do not redesign paragraph index semantics.
- Do not bundle unrelated `replace_text` behavior unless it already consumes the same shared anchor lookup API during implementation.

## Stakes

For thesis/advisor review workflows, equations are often referenced in natural prose as `H₀`, `αᵢ`, or similar notation. If agents cannot anchor insertions near those expressions, comment insertion and advisor-response workflows become brittle exactly where precision matters most.

The main risk is silent over-matching. Making the mode explicit, default-off, and limited to math script variants keeps the contract reviewable and avoids surprising callers that depend on exact anchors.

## Issue Link

Refs PsychQuant/che-word-mcp#90.
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# che-word-mcp-insertion-tools Specification — script-variant-anchor-matching delta

## ADDED Requirements

### Requirement: Insertion text anchors expose math-script-insensitive matching as an opt-in match option

The che-word-mcp insertion tools that accept text anchors SHALL expose a `match_options` object with a boolean `math_script_insensitive` field. Omitted `match_options` and omitted `math_script_insensitive` SHALL behave as `false`.

The option SHALL apply only to text-anchor resolution for `after_text` and `before_text`. It SHALL NOT change paragraph-index anchors, table-cell anchors, image/table anchors, inserted content, or success response wording.

The initial scoped tools are:

| Tool | Scoped anchors |
|---|---|
| `insert_paragraph` | `after_text`, `before_text` |
| `insert_equation` display mode | `after_text`, `before_text` |
| `insert_image_from_path` | `after_text`, `before_text` |
| `insert_caption` | `after_text`, `before_text` |

#### Scenario: Schema exposes match_options on insertion tools

- **WHEN** `tools/list` is requested
- **THEN** each scoped insertion tool schema includes `match_options.math_script_insensitive`
- **AND** the schema describes the default as exact matching

#### Scenario: Omitted match_options keeps exact matching

- **GIVEN** a document paragraph whose flattened text contains `"H0"`
- **WHEN** `insert_paragraph({ doc_id, text: "note", after_text: "H₀" })` is called without `match_options`
- **THEN** the tool follows the existing exact matching behavior and reports text not found

#### Scenario: Enabled option matches Unicode subscript anchor to ASCII flattened text

- **GIVEN** a document paragraph whose flattened text contains `"H0"`
- **WHEN** `insert_paragraph({ doc_id, text: "note", after_text: "H₀", match_options: { math_script_insensitive: true } })` is called
- **THEN** the tool inserts after the matched paragraph

#### Scenario: Enabled option matches ASCII anchor to Unicode subscript text

- **GIVEN** a document paragraph whose flattened text contains `"H₀"`
- **WHEN** `insert_caption({ doc_id, label: "Equation", caption_text: "Null hypothesis", after_text: "H0", match_options: { math_script_insensitive: true } })` is called
- **THEN** the caption is inserted after the matched paragraph

#### Scenario: Option does not affect non-text anchors

- **WHEN** `insert_image_from_path({ doc_id, path, index: 0, match_options: { math_script_insensitive: true } })` is called
- **THEN** the `index` anchor behavior is unchanged

---

### Requirement: MCP match_options MUST thread through the shared OOXML anchor lookup option

che-word-mcp SHALL parse `match_options.math_script_insensitive` into the shared OOXMLSwift anchor lookup option rather than implementing per-tool Unicode replacement in the MCP layer.

Direct Mode and Session Mode SHALL use the same parser and matching semantics.

#### Scenario: Direct Mode and Session Mode agree

- **GIVEN** the same document content and the same `after_text: "H₀"` request with `match_options.math_script_insensitive: true`
- **WHEN** the insertion is run once with `source_path` Direct Mode and once with `doc_id` Session Mode
- **THEN** both modes resolve the same target paragraph

#### Scenario: Invalid match_options type is rejected by existing schema validation

- **WHEN** a caller passes `match_options` as a non-object value
- **THEN** the request is rejected consistently with existing MCP schema/type validation behavior
- **AND** the server does not silently reinterpret the invalid value as enabled matching
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# ooxml-paragraph-text-mirror Specification — script-variant-anchor-matching delta

## ADDED Requirements

### Requirement: Anchor lookup MAY normalize math script variants when explicitly requested

OOXMLSwift paragraph/body-child text anchor lookup SHALL support an explicit option for math-script-insensitive matching. The default lookup mode SHALL remain exact matching.

When math-script-insensitive matching is enabled, the lookup implementation SHALL normalize both the searched document text (haystack) and the caller-provided anchor text (needle) through the same math-script canonicalization table before matching.

The canonicalization table SHALL map supported Unicode subscript and superscript digits, signs, grouping characters, and common Latin letters to their closest ASCII representation. Characters not listed in the table SHALL remain unchanged.

The option MUST NOT change `Paragraph.flattenedDisplayText()`, OMML `visibleText`, export text output, or any other read API output.

#### Scenario: Default exact mode preserves current mismatch

- **GIVEN** a paragraph whose flattened display text is `"H0 is rejected"`
- **WHEN** anchor lookup searches for `"H₀"` with default options
- **THEN** no match is returned

#### Scenario: Unicode subscript needle matches ASCII haystack when enabled

- **GIVEN** a paragraph whose flattened display text is `"H0 is rejected"`
- **WHEN** anchor lookup searches for `"H₀"` with math-script-insensitive matching enabled
- **THEN** the paragraph's body-child index is returned

#### Scenario: ASCII needle matches Unicode subscript haystack when enabled

- **GIVEN** a paragraph whose flattened display text is `"H₀ is rejected"`
- **WHEN** anchor lookup searches for `"H0"` with math-script-insensitive matching enabled
- **THEN** the paragraph's body-child index is returned

#### Scenario: Superscript and subscript letters normalize consistently

- **GIVEN** a paragraph whose flattened display text is `"xᵢ + y²"`
- **WHEN** anchor lookup searches for `"xi + y2"` with math-script-insensitive matching enabled
- **THEN** the paragraph's body-child index is returned

#### Scenario: nth-instance selection counts normalized matches

- **GIVEN** body paragraphs flatten to `["H0 first", "H₀ second"]`
- **WHEN** anchor lookup searches for `"H₀"` with instance `2` and math-script-insensitive matching enabled
- **THEN** the second paragraph's body-child index is returned

#### Scenario: Unsupported characters are preserved

- **GIVEN** a paragraph containing an unsupported Unicode symbol that is not in the math-script canonicalization table
- **WHEN** math-script-insensitive lookup runs
- **THEN** that symbol remains unchanged for matching purposes

---

### Requirement: Math visible text output remains unchanged by script-variant matching

Math-script-insensitive matching SHALL be implemented as a lookup-time normalization mode only. It SHALL NOT alter OMML visible-text generation or paragraph flattening output.

#### Scenario: OMML subscript visible text remains ASCII

- **GIVEN** an OMML subscript expression representing `H₀`
- **WHEN** its visible text is read
- **THEN** the output remains the existing ASCII mirror, such as `"H0"`
- **AND** no Unicode subscript character is introduced by the matcher option

#### Scenario: flattenedDisplayText is unaffected

- **GIVEN** a paragraph containing the OMML expression for `H₀`
- **WHEN** `flattenedDisplayText()` is called
- **THEN** the result remains identical to exact-mode behavior
- **AND** enabling math-script-insensitive lookup elsewhere does not mutate the paragraph or its flattened output
12 changes: 12 additions & 0 deletions openspec/changes/script-variant-anchor-matching/tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
## Tasks

- [x] Add OOXMLSwift `AnchorLookupOptions` or equivalent public options type with exact matching as the default.
- [x] Implement a shared math-script canonicalization helper with explicit mapping for supported Unicode subscript/superscript characters.
- [x] Thread the options type through `findBodyChildContainingText` and the insertion-anchor call sites that depend on it.
- [x] Add OOXMLSwift tests for exact mode, math-script-insensitive mode, bidirectional matching, nth-instance behavior, and unsupported-character preservation.
- [x] Add che-word-mcp `match_options.math_script_insensitive` schema support to the 4 scoped insertion tools.
- [x] Parse MCP `match_options` into the OOXMLSwift anchor lookup option for the scoped insertion tools. Current insertion mutators are Session Mode only; no `source_path` mutating Direct Mode surface exists in this change.
- [x] Add che-word-mcp tests proving schema exposure, default exact behavior, and successful `H₀` / `H0` insertion anchors when the option is enabled.
- [x] Update README/tool documentation with the exact matching default and opt-in math-script matching examples.
- [x] Verify with targeted Swift tests in both affected repos.
- [x] Open implementation PRs referencing PsychQuant/che-word-mcp#90.
Loading