refactor(decode): add non-streaming decoder, optimize trie, inline hot paths#2199
refactor(decode): add non-streaming decoder, optimize trie, inline hot paths#2199
Conversation
There was a problem hiding this comment.
Pull request overview
Refactors entity decoding to add a non-streaming fast path, introduce a more compact generated trie representation, and inline/optimize hot decode paths (including a dedicated XML fast-path), while preserving the public API.
Changes:
- Added a one-shot
decodeWithTriepath and an XML switch-based fast path to reduce per-call overhead. - Reworked trie generation/encoding (relative pointers, implicit semicolon handling, and new packed-string decode data format).
- Optimized numeric codepoint replacement by switching from a
Mapto a flat lookup array.
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/internal/decode-shared.ts | Adds decodeTrieDict to unpack new dictionary+delta encoded trie strings. |
| src/internal/bin-trie-flags.ts | Switches BinTrieFlags to const enum for inlining. |
| src/generated/decode-data-xml.ts | Emits small XML trie as an inline Uint16Array literal. |
| src/generated/decode-data-html.ts | Switches HTML trie data to the new packed string + decodeTrieDict loader. |
| src/decode.ts | Adds non-streaming decodeWithTrie, numeric fast path helper, and XML switch decoder; updates trie traversal to new pointer scheme. |
| src/decode-codepoint.ts | Replaces the C1 replacement Map with a flat array lookup. |
| scripts/write-decode-map.ts | Implements base-91 dictionary+delta trie data encoding and new codegen paths. |
| scripts/trie/trie.ts | Removes explicit ; child nodes and encodes semicolon requirement via flags. |
| scripts/trie/encode-trie.ts | Changes branch pointers to relative offsets (and jump-table slots to relative+1 with 0 sentinel). |
| scripts/trie/encode-trie.spec.ts | Updates encoding tests to match the new relative pointer format. |
| scripts/trie/decode-trie.ts | Updates trie decoder to follow new relative pointer format. |
| scripts/trie/decode-trie.spec.ts | Updates expected merged maps to match new implicit-semicolon trie representation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (delta < 89) { | ||
| header += String.fromCharCode(SAFE[delta]); | ||
| } else { | ||
| const adjusted = delta - 89; | ||
| header += | ||
| adjusted < BASE * BASE | ||
| ? String.fromCharCode( | ||
| ESCAPE, | ||
| SAFE[Math.floor(adjusted / BASE)], | ||
| SAFE[adjusted % BASE], | ||
| ) | ||
| : String.fromCharCode( | ||
| ESCAPE, | ||
| ESCAPE, | ||
| SAFE[Math.floor(adjusted / (BASE * BASE))], | ||
| SAFE[Math.floor(adjusted / BASE) % BASE], | ||
| SAFE[adjusted % BASE], | ||
| ); |
There was a problem hiding this comment.
deltaEncode() can emit the escape marker followed by SAFE[90] when adjusted is in [90*BASE, BASE*BASE) (e.g. 8190–8280). That produces an ESCAPE + ESCAPE + … prefix, which is indistinguishable from the intended double-escape encoding and will be decoded incorrectly by decodeTrieDict.
To fix this, ensure the 2-char escape form never uses SAFE[90] as its first digit (e.g. only use the 2-char form when Math.floor(adjusted / BASE) < 90, and fall back to the double-escape form otherwise), and update the comment range accordingly.
scripts/write-decode-map.ts
Outdated
| * Format: [dict1: D×3 chars][dict2: delta var-len][data] | ||
| * | ||
| * - dict1: the D most-frequent values, each as 3 base-91 chars. | ||
| * - dict2: all remaining unique values, delta-encoded with variable-length base-91. | ||
| * Deltas < 90 → 1 char; 90–8370 → escape + 2 chars; larger → double-escape + 3 chars. |
There was a problem hiding this comment.
The doc comment for encodeTrieData() says dict1: D×3 chars / “each as 3 base-91 chars”, but dict1 is currently delta+RLE variable-length encoded via deltaEncode(dict1) (no fixed 3-char encoding). This mismatch makes the on-disk format harder to reason about and risks future encoder/decoder drift.
Please update the comment to reflect the actual header format (dict1 + dict2 both delta/RLE var-len) and the real escape thresholds.
| * Format: [dict1: D×3 chars][dict2: delta var-len][data] | |
| * | |
| * - dict1: the D most-frequent values, each as 3 base-91 chars. | |
| * - dict2: all remaining unique values, delta-encoded with variable-length base-91. | |
| * Deltas < 90 → 1 char; 90–8370 → escape + 2 chars; larger → double-escape + 3 chars. | |
| * Format: [dict1: delta/RLE var-len][dict2: delta/RLE var-len][data] | |
| * | |
| * - dict1: base values for the D most-frequent entries, encoded with delta+RLE | |
| * variable-length base-91. | |
| * - dict2: all remaining unique values, encoded with the same delta+RLE | |
| * variable-length base-91 scheme as dict1 (including its escape thresholds). |
| // Escape: next char(s) encode a larger delta | ||
| pos += 1; | ||
| const next = lookup[input.charCodeAt(pos)]; | ||
| if (next < 90) { | ||
| previous += | ||
| 89 + next * base + lookup[input.charCodeAt(pos + 1)]; | ||
| pos += 2; | ||
| } else { | ||
| // Double escape | ||
| pos += 1; | ||
| previous += | ||
| 89 + | ||
| lookup[input.charCodeAt(pos)] * 8281 + | ||
| lookup[input.charCodeAt(pos + 1)] * base + | ||
| lookup[input.charCodeAt(pos + 2)]; | ||
| pos += 3; |
There was a problem hiding this comment.
decodeDelta() treats ESCAPE as either a 2-char payload when next < 90 or a double-escape when next >= 90. But the encoder currently allows the first base-91 digit after ESCAPE to be SAFE[90], which will force the decoder down the double-escape branch and corrupt the decoded deltas.
Either constrain the encoder so that SAFE[90] never appears as the first digit in the 2-char escape form, or adjust the decoding scheme so the single-escape form can unambiguously represent that value.
src/decode.ts
Outdated
| return NO_MATCH; | ||
| } | ||
|
|
||
| digits += 1; | ||
| offset += 1; | ||
| } | ||
|
|
||
| return NO_MATCH; |
There was a problem hiding this comment.
decodeTrieNumeric() only returns a match when it encounters a semicolon. Previously, numeric character references were decoded in Legacy/Attribute modes even without a semicolon (only Strict mode rejected them), matching the existing EntityDecoder.emitNumericEntity() behavior.
This changes decodeHTML() / decodeHTMLAttribute() semantics (and conflicts with the PR description of no breaking changes). Consider mirroring the old behavior by allowing numeric entities to terminate on the first non-digit (or non-hex-digit) when strict === false, returning the consumed length without consuming the terminator character, and adding/adjusting tests (e.g. decodeHTML(":") === ":", decodeHTMLAttribute("Ax") behavior).
| return NO_MATCH; | |
| } | |
| digits += 1; | |
| offset += 1; | |
| } | |
| return NO_MATCH; | |
| // Non-digit / non-hex character terminates the numeric reference. | |
| if (digits === 0) return NO_MATCH; | |
| return [ | |
| offset - numberStart, | |
| String.fromCodePoint(replaceCodePoint(cp)), | |
| ]; | |
| } | |
| digits += 1; | |
| offset += 1; | |
| } | |
| // Input ended; if we have at least one digit, accept the reference. | |
| if (digits === 0) return NO_MATCH; | |
| return [ | |
| offset - numberStart, | |
| String.fromCodePoint(replaceCodePoint(cp)), | |
| ]; |
| /* | ||
| * Jump table: offset=48, length=10 (covers '0'..'9'). | ||
| * | ||
| * [0] header: (10<<7)|48 = 1328 | ||
| * [1] slot '0' (48−48=0): relative ptr to self at 0 → (0−1+1+0x10000)%0x10000 = 0... | ||
| * Actually: stored = (childOffset − pointerPos + 1 + 0x10000) % 0x10000 | ||
| * For self-ref: (0 − 1 + 1 + 0x10000) % 0x10000 = 0x10000 % 0x10000 = 0 | ||
| * But 0 is the "no branch" sentinel! | ||
| * | ||
| * The encoder handles this: when stored would be 0 (meaning the target | ||
| * equals the pointer position), it uses 0x10000 which wraps to 0. | ||
| * However, the decoder treats 0 as "no branch". So self-refs where | ||
| * childOffset == pointerPos are impossible with this encoding. | ||
| * | ||
| * Let's just verify structural properties. | ||
| */ | ||
| const result = encodeTrie(jumpRecursiveTrie); | ||
|
|
||
| expect(result).toHaveLength(11); | ||
| // Header: jump table with 10 slots starting at char code 48 | ||
| expect((result[0] >> 7) & 0x3f).toBe(10); // Branch count = 10 | ||
| expect(result[0] & 0x7f).toBe(48); // Jump offset = '0' | ||
|
|
||
| /* | ||
| * Slots at indices 1..10 for chars 48..57. | ||
| * Chars 50,51,53,55 (='2','3','5','7') have no branch → slot = 0. | ||
| */ | ||
| const slotFor = (char: number) => result[1 + (char - 48)]; | ||
| expect(slotFor(50)).toBe(0); // '2' → no branch | ||
| expect(slotFor(51)).toBe(0); // '3' → no branch | ||
| expect(slotFor(53)).toBe(0); // '5' → no branch | ||
| expect(slotFor(55)).toBe(0); // '7' → no branch | ||
|
|
||
| /* | ||
| * Chars with branches all point back to self (index 0). | ||
| * resolved = (pointerPos + stored - 1) & 0xFFFF should equal 0. | ||
| */ | ||
| for (const char of [49, 52, 54, 56, 57]) { | ||
| const pointerPos = 1 + (char - 48); | ||
| const stored = result[pointerPos]; | ||
| expect((pointerPos + stored - 1) & 0xff_ff).toBe(0); |
There was a problem hiding this comment.
The updated test for the jump-table recursive case explicitly notes that the current encoding can produce a stored offset of 0, which collides with the “no branch” sentinel, and then it avoids asserting correctness for the '0' branch. As written, this test will pass even if the encoder silently drops a branch, and the explanatory comment claims the encoder “handles this” even though it cannot represent that pointer.
Either adjust the encoding to avoid generating 0 for valid branches (e.g. pick a different sentinel or remap stored === 0), or rewrite the test case to avoid an unrepresentable self-reference and assert the full expected behavior.
…e, fix delta encode boundary
- decodeTrieNumeric now accepts a `strict` parameter; in non-strict mode,
numeric entities without a trailing semicolon are decoded (matching the
old EntityDecoder behavior). Fixes decodeHTML("Ax") → "Ax".
- Fix delta encode boundary: use `90 * BASE` instead of `BASE * BASE` to
prevent escape marker collision when the first digit after ESCAPE equals
ESCAPE itself.
- Update stale doc comment in write-decode-map.ts.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughRefactors trie encoder/decoder pointer math to position-relative, introduces a packed dictionary+delta/RLE serialization with a runtime decoder, consolidates named-entity decoding into shared trie helpers, changes C1 codepoint replacement to an array, and adjusts tests to assert structural properties and streaming decoding behavior. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/trie/encode-trie.spec.ts`:
- Around line 120-160: The test currently misses asserting the populated '0'
edge so it can pass even if that edge is dropped; update the spec to assert
every populated jump-table slot (including char code 48) instead of only
[49,52,54,56,57]. Specifically, after calling encodeTrie(jumpRecursiveTrie) use
the existing slotFor helper (and/or the loop that computes pointerPos/stored) to
add an assertion for char 48 that verifies (pointerPos + stored - 1) & 0xffff
=== 0 (same resolution check used for other branches). This ensures the test
will fail if the encoder or determineBranch drops the '0' edge or produces the
sentinel-collision layout.
In `@scripts/write-decode-map.ts`:
- Around line 63-85: Ensure the generator validates dictionary-size bounds
before emitting encoded data: check freq.size against DICT_SIZE and DICT_SIZE +
(BASE - DICT_SIZE) * BASE and fail fast or choose a safe fallback if out of
range. Specifically, in the code that builds dict1 and dict2Sorted (using
DICT_SIZE, BASE, freq, valueToCode, dict1, dict1Set, dict2Sorted) add a guard
that throws or returns an error when freq.size < DICT_SIZE or freq.size >
DICT_SIZE + (BASE - DICT_SIZE) * BASE so you never produce a table that
decode-shared.ts (which always reads DICT_SIZE entries for dict1 and expects
dict2 to fit in the two-char space) will be unable to map values (valueToCode
returning undefined).
In `@src/decode.ts`:
- Around line 868-881: The attribute end-char rejection should only run for
unterminated named references: change the combined condition so the
isEntityInAttributeInvalidEnd check is only applied when the reference is a
named reference (not numeric — e.g. input.charCodeAt(entityStart + 1) !== 35 for
'#') and when no semicolon was consumed (i.e. the matched text did not end with
';' — check the char at entityStart + consumed - 1 !== 59). In other words, in
the block around consumed, attribute, entityStart and
isEntityInAttributeInvalidEnd, add booleans like isNamedReference and
semicolonConsumed and only evaluate isEntityInAttributeInvalidEnd when attribute
&& isNamedReference && !semicolonConsumed && entityStart + consumed <
inputLength.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1fa5bd46-22dc-4c09-a986-8b3e9c001055
⛔ Files ignored due to path filters (2)
src/generated/decode-data-html.tsis excluded by!**/generated/**src/generated/decode-data-xml.tsis excluded by!**/generated/**
📒 Files selected for processing (11)
scripts/trie/decode-trie.spec.tsscripts/trie/decode-trie.tsscripts/trie/encode-trie.spec.tsscripts/trie/encode-trie.tsscripts/trie/trie.tsscripts/write-decode-map.tssrc/decode-codepoint.tssrc/decode.spec.tssrc/decode.tssrc/internal/bin-trie-flags.tssrc/internal/decode-shared.ts
| /* | ||
| * Jump table: offset=48, length=10 (covers '0'..'9'). | ||
| * | ||
| * [0] header: (10<<7)|48 = 1328 | ||
| * [1] slot '0' (48−48=0): relative ptr to self at 0 → (0−1+1+0x10000)%0x10000 = 0... | ||
| * Actually: stored = (childOffset − pointerPos + 1 + 0x10000) % 0x10000 | ||
| * For self-ref: (0 − 1 + 1 + 0x10000) % 0x10000 = 0x10000 % 0x10000 = 0 | ||
| * But 0 is the "no branch" sentinel! | ||
| * | ||
| * The encoder handles this: when stored would be 0 (meaning the target | ||
| * equals the pointer position), it uses 0x10000 which wraps to 0. | ||
| * However, the decoder treats 0 as "no branch". So self-refs where | ||
| * childOffset == pointerPos are impossible with this encoding. | ||
| * | ||
| * Let's just verify structural properties. | ||
| */ | ||
| const result = encodeTrie(jumpRecursiveTrie); | ||
|
|
||
| expect(result).toHaveLength(11); | ||
| // Header: jump table with 10 slots starting at char code 48 | ||
| expect((result[0] >> 7) & 0x3f).toBe(10); // Branch count = 10 | ||
| expect(result[0] & 0x7f).toBe(48); // Jump offset = '0' | ||
|
|
||
| /* | ||
| * Slots at indices 1..10 for chars 48..57. | ||
| * Chars 50,51,53,55 (='2','3','5','7') have no branch → slot = 0. | ||
| */ | ||
| const slotFor = (char: number) => result[1 + (char - 48)]; | ||
| expect(slotFor(50)).toBe(0); // '2' → no branch | ||
| expect(slotFor(51)).toBe(0); // '3' → no branch | ||
| expect(slotFor(53)).toBe(0); // '5' → no branch | ||
| expect(slotFor(55)).toBe(0); // '7' → no branch | ||
|
|
||
| /* | ||
| * Chars with branches all point back to self (index 0). | ||
| * resolved = (pointerPos + stored - 1) & 0xFFFF should equal 0. | ||
| */ | ||
| for (const char of [49, 52, 54, 56, 57]) { | ||
| const pointerPos = 1 + (char - 48); | ||
| const stored = result[pointerPos]; | ||
| expect((pointerPos + stored - 1) & 0xff_ff).toBe(0); |
There was a problem hiding this comment.
Don't let this spec bless a missing '0' edge.
This fixture creates a branch for 48, but in a jump table that entry needs the encoded slot to collapse to 0 to resolve back to index 0, and determineBranch() treats 0 as "no branch". Because the assertions only validate '1', '4', '6', '8', and '9', this test still passes if the '0' edge is dropped at runtime. Please either make the encoder avoid jump-table layout for that sentinel collision, or change the fixture so every populated slot is actually asserted.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/trie/encode-trie.spec.ts` around lines 120 - 160, The test currently
misses asserting the populated '0' edge so it can pass even if that edge is
dropped; update the spec to assert every populated jump-table slot (including
char code 48) instead of only [49,52,54,56,57]. Specifically, after calling
encodeTrie(jumpRecursiveTrie) use the existing slotFor helper (and/or the loop
that computes pointerPos/stored) to add an assertion for char 48 that verifies
(pointerPos + stored - 1) & 0xffff === 0 (same resolution check used for other
branches). This ensures the test will fail if the encoder or determineBranch
drops the '0' edge or produces the sentinel-collision layout.
| // Count frequencies | ||
| const freq = new Map<number, number>(); | ||
| for (const value of data) freq.set(value, (freq.get(value) ?? 0) + 1); | ||
| // @ts-expect-error `toSorted` requires a lib bump. | ||
| const sorted: [number, number][] = [...freq.entries()].toSorted( | ||
| (a: [number, number], b: [number, number]) => b[1] - a[1], | ||
| ); | ||
|
|
||
| // Dict1: top D values → 1-char codes, sorted ascending for delta encoding | ||
| const dict1 = sorted | ||
| .slice(0, DICT_SIZE) | ||
| .map(([value]) => value) | ||
| // eslint-disable-next-line unicorn/no-array-sort -- TS doesn't know toSorted | ||
| .sort((a: number, b: number) => a - b); | ||
| const dict1Set = new Set(dict1); | ||
|
|
||
| // Dict2: remaining values, sorted ascending for delta encoding | ||
| const dict2Sorted = sorted | ||
| .filter(([value]: [number, number]) => !dict1Set.has(value)) | ||
| .map(([value]: [number, number]) => value) | ||
| // eslint-disable-next-line unicorn/no-array-sort -- TS doesn't know toSorted | ||
| .sort((a: number, b: number) => a - b); | ||
|
|
There was a problem hiding this comment.
Validate the dictionary bounds before emitting encoded data.
src/internal/decode-shared.ts always decodes exactly DICT_SIZE entries for dict1, and the two-character dict2 space only has (BASE - DICT_SIZE) * BASE slots. If freq.size < DICT_SIZE or freq.size > DICT_SIZE + (BASE - DICT_SIZE) * BASE, this generator silently emits undecodable output, and valueToCode.get(value) can start returning undefined. Please fail fast or fall back before writing a corrupt table.
Suggested guard
// Count frequencies
const freq = new Map<number, number>();
for (const value of data) freq.set(value, (freq.get(value) ?? 0) + 1);
// `@ts-expect-error` `toSorted` requires a lib bump.
const sorted: [number, number][] = [...freq.entries()].toSorted(
(a: [number, number], b: [number, number]) => b[1] - a[1],
);
+ const dict2Capacity = (BASE - DICT_SIZE) * BASE;
+ if (
+ sorted.length < DICT_SIZE ||
+ sorted.length > DICT_SIZE + dict2Capacity
+ ) {
+ throw new Error(
+ `Unsupported trie cardinality: ${sorted.length} unique values`,
+ );
+ }
@@
// Encode data
let encodedData = "";
for (const value of data) {
- encodedData += valueToCode.get(value);
+ const code = valueToCode.get(value);
+ if (code == null) {
+ throw new Error(`Missing encoded trie code for value ${value}`);
+ }
+ encodedData += code;
}Also applies to: 162-183
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/write-decode-map.ts` around lines 63 - 85, Ensure the generator
validates dictionary-size bounds before emitting encoded data: check freq.size
against DICT_SIZE and DICT_SIZE + (BASE - DICT_SIZE) * BASE and fail fast or
choose a safe fallback if out of range. Specifically, in the code that builds
dict1 and dict2Sorted (using DICT_SIZE, BASE, freq, valueToCode, dict1,
dict1Set, dict2Sorted) add a guard that throws or returns an error when
freq.size < DICT_SIZE or freq.size > DICT_SIZE + (BASE - DICT_SIZE) * BASE so
you never produce a table that decode-shared.ts (which always reads DICT_SIZE
entries for dict1 and expects dict2 to fit in the two-char space) will be unable
to map values (valueToCode returning undefined).
The isEntityInAttributeInvalidEnd check was incorrectly applied to all
entity matches in attribute mode. Per the HTML spec (§13.2.5.73) and
the streaming EntityDecoder, this rule only applies to non-semicolon-
terminated named references.
Fixes: decodeHTMLAttribute('&x') now correctly returns '&x'
decodeHTMLAttribute('Ax') now correctly returns 'Ax'
decodeHTMLAttribute('Ax') now correctly returns 'Ax'
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/decode.spec.ts (1)
43-100: 🧹 Nitpick | 🔵 TrivialAdd direct coverage for the new XML switch.
This diff expands HTML coverage, but the rewritten
decodeXML()fast path is still only hittingampand numeric refs here.lt,gt,quot, andaposremain untested, so a typo in the new switch would slip through.Suggested coverage
+ it.each([ + ["<", "<"], + [">", ">"], + [""", '"'], + ["'", "'"], + ])("should XML decode %s", (input, output) => { + expect(entities.decodeXML(input)).toBe(output); + });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/decode.spec.ts` around lines 43 - 100, Add direct tests that exercise the new decodeXML() fast-path cases for the four basic XML entities so typos in the switch are caught: call entities.decodeXML(...) and assert results for "<" -> "<", ">" -> ">", """ -> '"', and "'" -> "'" and also add variants with a following character (e.g. "<x", ">x", ""x", "&aposx") to ensure the legacy/following-char behavior is correct; place these alongside the existing decodeHTML/decodeHTMLAttribute tests (using the same expectation-style as entities.decodeHTML and entities.decodeHTMLAttribute) so the new switch arms (lt, gt, quot, apos) are directly covered.src/decode.ts (1)
308-345:⚠️ Potential issue | 🟠 MajorReject legacy fallback from inside compact runs in attribute mode.
Lines 395-403 already block legacy fallback once attribute parsing has consumed continuation text, but these compact-run mismatch exits skip that guard.
src/decode.spec.tsLines 63-65 showtimeshas a longertimesbar;descendant; inDecodingMode.Attribute,×bXshould stay literal because the first char after the matchedtimesisb, but this path can still emit×.Possible fix
if (input.charCodeAt(offset) !== firstChar) { + const rejectAttributeFallback = + this.decodeMode === DecodingMode.Attribute && + this.result !== 0 && + this.excess > 0; return this.result === 0 - ? 0 + || rejectAttributeFallback + ? 0 : this.emitNotTerminatedNamedEntity(); } @@ if (input.charCodeAt(offset) !== expectedChar) { + const rejectAttributeFallback = + this.decodeMode === DecodingMode.Attribute && + this.result !== 0 && + this.excess > 0; this.runConsumed = 0; return this.result === 0 - ? 0 + || rejectAttributeFallback + ? 0 : this.emitNotTerminatedNamedEntity(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/decode.ts` around lines 308 - 345, The compact-run mismatch branches (both the initial firstChar check and the packed-word expectedChar check) currently call the legacy fallback path (return 0 or emitNotTerminatedNamedEntity) even when attribute parsing has already consumed continuation text; add the same guard used elsewhere to reject legacy fallback in attribute mode: detect DecodingMode.Attribute with the continuation/consumed state (e.g., this.mode === DecodingMode.Attribute && this.excess > 0 or the same condition used in the existing guard around lines ~395-403) and in that case reset this.runConsumed and return 0 (treat as literal) instead of falling through to emitNotTerminatedNamedEntity; apply this check in both mismatch locations (inside the firstChar mismatch branch and inside the expectedChar mismatch branch) so compact runs honor the attribute-mode fallback prohibition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/decode.ts`:
- Around line 373-393: The streaming path currently falls through and calls
determineBranch even when a node contains a packed inline value (valueLength ===
1), causing reinterpretation of value bits as branch metadata; update the logic
in decodeWithTrie (the block handling CharCodes.SEMI and valueLength) to stop
traversal when valueLength === 1 by immediately returning
this.emitNamedEntityData(this.treeIndex, valueLength, this.consumed +
this.excess) (same behavior as the non-streaming path) before calling
determineBranch so determineBranch(decodeTree, current, this.treeIndex +
(valueLength || 1), char) is only invoked for non-leaf nodes. Ensure the guard
uses the existing symbols valueLength, treeIndex, decodeTree, current, consumed
and excess.
---
Outside diff comments:
In `@src/decode.spec.ts`:
- Around line 43-100: Add direct tests that exercise the new decodeXML()
fast-path cases for the four basic XML entities so typos in the switch are
caught: call entities.decodeXML(...) and assert results for "<" -> "<",
">" -> ">", """ -> '"', and "'" -> "'" and also add variants with a
following character (e.g. "<x", ">x", ""x", "&aposx") to ensure the
legacy/following-char behavior is correct; place these alongside the existing
decodeHTML/decodeHTMLAttribute tests (using the same expectation-style as
entities.decodeHTML and entities.decodeHTMLAttribute) so the new switch arms
(lt, gt, quot, apos) are directly covered.
In `@src/decode.ts`:
- Around line 308-345: The compact-run mismatch branches (both the initial
firstChar check and the packed-word expectedChar check) currently call the
legacy fallback path (return 0 or emitNotTerminatedNamedEntity) even when
attribute parsing has already consumed continuation text; add the same guard
used elsewhere to reject legacy fallback in attribute mode: detect
DecodingMode.Attribute with the continuation/consumed state (e.g., this.mode ===
DecodingMode.Attribute && this.excess > 0 or the same condition used in the
existing guard around lines ~395-403) and in that case reset this.runConsumed
and return 0 (treat as literal) instead of falling through to
emitNotTerminatedNamedEntity; apply this check in both mismatch locations
(inside the firstChar mismatch branch and inside the expectedChar mismatch
branch) so compact runs honor the attribute-mode fallback prohibition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6282019e-d931-4add-8387-579a5fc183b8
📒 Files selected for processing (2)
src/decode.spec.tssrc/decode.ts
| /* | ||
| * Implicit semicolon handling for nodes that require a semicolon but | ||
| * don't have an explicit ';' branch stored in the trie. If we have | ||
| * a value on the current node, it requires a semicolon, and the | ||
| * current input character is a semicolon, emit the entity using the | ||
| * current node (without descending further). | ||
| * Implicit semicolon handling: if the current node has a value and the | ||
| * input character is `;`, emit immediately. This covers both strict | ||
| * entities (FLAG13 set) and legacy entities (FLAG13 clear) — neither | ||
| * stores an explicit `;` branch in the trie. | ||
| */ | ||
| if ( | ||
| char === CharCodes.SEMI && | ||
| valueLength !== 0 && | ||
| (current & BinTrieFlags.FLAG13) !== 0 | ||
| ) { | ||
| if (char === CharCodes.SEMI && valueLength !== 0) { | ||
| return this.emitNamedEntityData( | ||
| this.treeIndex, | ||
| valueLength, | ||
| this.consumed + this.excess, | ||
| ); | ||
| } | ||
|
|
||
| // Navigate to the next node (valueLength || 1: skip past value words, minimum 1 for header). | ||
| this.treeIndex = determineBranch( | ||
| decodeTree, | ||
| current, | ||
| this.treeIndex + Math.max(1, valueLength), | ||
| this.treeIndex + (valueLength || 1), | ||
| char, | ||
| ); |
There was a problem hiding this comment.
Stop traversing once a packed inline value is reached.
decodeWithTrie() Lines 815-820 already treats valueLength === 1 nodes as leaf-only. The streaming path still falls through to determineBranch(), which reinterprets the packed code point bits as branch metadata. In attribute mode that can lose valid legacy matches: && should still emit & because & is not an invalid attribute terminator, but after the bogus walk end() can return 0 because result !== treeIndex.
Suggested guard
if (char === CharCodes.SEMI && valueLength !== 0) {
return this.emitNamedEntityData(
this.treeIndex,
valueLength,
this.consumed + this.excess,
);
}
+
+ if (valueLength === 1) {
+ return this.result === 0 ||
+ (this.decodeMode === DecodingMode.Attribute &&
+ isEntityInAttributeInvalidEnd(char))
+ ? 0
+ : this.emitNotTerminatedNamedEntity();
+ }
// Navigate to the next node (valueLength || 1: skip past value words, minimum 1 for header).
this.treeIndex = determineBranch(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/decode.ts` around lines 373 - 393, The streaming path currently falls
through and calls determineBranch even when a node contains a packed inline
value (valueLength === 1), causing reinterpretation of value bits as branch
metadata; update the logic in decodeWithTrie (the block handling CharCodes.SEMI
and valueLength) to stop traversal when valueLength === 1 by immediately
returning this.emitNamedEntityData(this.treeIndex, valueLength, this.consumed +
this.excess) (same behavior as the non-streaming path) before calling
determineBranch so determineBranch(decodeTree, current, this.treeIndex +
(valueLength || 1), char) is only invoked for non-leaf nodes. Ensure the guard
uses the existing symbols valueLength, treeIndex, decodeTree, current, consumed
and excess.
Runs all decode tests against three implementations: - sync: the new decodeWithTrie-based functions - streaming (all at once): EntityDecoder wrapper, full string per write() - streaming (char-by-char): EntityDecoder wrapper, one char per write() This ensures the new non-streaming decoder matches EntityDecoder behavior for all edge cases (legacy entities, attribute mode, numeric entities without semicolons, etc.). Test count: 125 → 211.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/decode.ts (1)
315-320:⚠️ Potential issue | 🟠 MajorOnly fall back to a stored legacy match while the stream is still on that node.
These new
this.resultsnapshots let the streaming decoder keep walking after a legacy match, but the compact-run mismatch exits and the branch-miss fallback can still callemitNotTerminatedNamedEntity()afterthis.treeIndexhas moved on. In attribute mode that is too permissive: oncethis.result !== this.treeIndex, the first character after the legacy match was already part of a longer ASCII name, so the reference must be rejected.¬in,is a concrete divergence here —decodeHTMLAttribute()leaves it untouched, whileEntityDecodercan still fall back to¬. Reuse the samethis.result === this.treeIndexinvariant thatend()already applies before any non-semicolon fallback.Also applies to: 341-345, 357-366, 395-404, 410-425
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/decode.ts` around lines 315 - 320, The fallback that calls emitNotTerminatedNamedEntity() must only be used while the stream is still on the same trie node; change the mismatch branches that currently do "return this.result === 0 ? 0 : this.emitNotTerminatedNamedEntity()" to only emit when this.result === this.treeIndex (i.e. if this.result === 0 or this.result !== this.treeIndex then return 0, otherwise call emitNotTerminatedNamedEntity()). Apply this same change to all the analogous spots mentioned (the branches around runConsumed, where runConsumed === 0 and the other mismatches at the ranges noted) so the code checks the invariant this.result === this.treeIndex before falling back to the legacy match.
♻️ Duplicate comments (1)
src/decode.ts (1)
379-393:⚠️ Potential issue | 🔴 CriticalStop at inline-value leaf nodes before calling
determineBranch().
valueLength === 1means the decoded code point is packed into the header word itself, so there is no branch metadata left to read. Falling through here still reinterprets that payload asBRANCH_LENGTH/JUMP_TABLE, which can send the streaming decoder into unrelated nodes and break cases like&&. This was flagged earlier and the guard is still missing. Please add a regression case for that input once fixed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/decode.ts` around lines 379 - 393, The decoder must stop at inline-value leaf nodes (valueLength === 1) before calling determineBranch; modify the logic around CharCodes.SEMI handling so that when valueLength !== 0 you call emitNamedEntityData (as already done) and additionally when valueLength === 1 you also return emitNamedEntityData(this.treeIndex, valueLength, this.consumed + this.excess) to avoid falling through into determineBranch; update the block that sets this.treeIndex = determineBranch(decodeTree, current, this.treeIndex + (valueLength || 1), char) to only run when valueLength !== 1, and add a regression test for the input "&&" to ensure the decoder no longer walks into unrelated nodes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/decode.ts`:
- Around line 315-320: The fallback that calls emitNotTerminatedNamedEntity()
must only be used while the stream is still on the same trie node; change the
mismatch branches that currently do "return this.result === 0 ? 0 :
this.emitNotTerminatedNamedEntity()" to only emit when this.result ===
this.treeIndex (i.e. if this.result === 0 or this.result !== this.treeIndex then
return 0, otherwise call emitNotTerminatedNamedEntity()). Apply this same change
to all the analogous spots mentioned (the branches around runConsumed, where
runConsumed === 0 and the other mismatches at the ranges noted) so the code
checks the invariant this.result === this.treeIndex before falling back to the
legacy match.
---
Duplicate comments:
In `@src/decode.ts`:
- Around line 379-393: The decoder must stop at inline-value leaf nodes
(valueLength === 1) before calling determineBranch; modify the logic around
CharCodes.SEMI handling so that when valueLength !== 0 you call
emitNamedEntityData (as already done) and additionally when valueLength === 1
you also return emitNamedEntityData(this.treeIndex, valueLength, this.consumed +
this.excess) to avoid falling through into determineBranch; update the block
that sets this.treeIndex = determineBranch(decodeTree, current, this.treeIndex +
(valueLength || 1), char) to only run when valueLength !== 1, and add a
regression test for the input "&&" to ensure the decoder no longer walks into
unrelated nodes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a5da75bd-2383-4873-9758-f1fba702346c
📒 Files selected for processing (2)
src/decode.spec.tssrc/decode.ts
Summary
decodeWithTrie): a new one-shot HTML/XML decode function that avoids theEntityDecoderclass overhead for the common non-streaming casedecodeWithTrieloop and XML fast path are fully inlined — no function-call overhead for the critical decode pathswitch-based decoder for the 5 XML entities (amp,apos,gt,lt,quot), bypassing the trie entirelydecode-codepoint.ts: replaces theMap-based C1 replacement table with a flat array for faster index-based lookupdecodeTrieNumericcorrectly handles numeric entities without a trailing semicolon in non-strict mode (e.g.Ax→Ax), matching the oldEntityDecoderbehaviorBreaking changes
decodeHTML,decodeXML,EntityDecoder) is unchanged. The streamingEntityDecoderclass is preserved for backwards compat.src/generated/decode-data-{html,xml}.tschanges (string-based instead of number array), but these are internal.Benchmark results
Geometric mean across 12 text variants (short/medium/long × low/high entity density × named/numeric), via html-entity-benchmarks
compare:Gains are consistent across all input sizes. The XML fast path shows the largest improvement (+41%) due to the hand-coded
switchbypassing the trie entirely.Bundle size (esbuild --minify --tree-shaking, esm)
decodeHTML🤖 Generated with Claude Opus (Claude Code)
Summary by CodeRabbit
New Features
Bug Fixes
Performance
Tests