Skip to content

refactor(decode): add non-streaming decoder, optimize trie, inline hot paths#2199

Open
fb55 wants to merge 5 commits intomainfrom
decode-perf
Open

refactor(decode): add non-streaming decoder, optimize trie, inline hot paths#2199
fb55 wants to merge 5 commits intomainfrom
decode-perf

Conversation

@fb55
Copy link
Copy Markdown
Owner

@fb55 fb55 commented Mar 29, 2026

Summary

  • Non-streaming decoder (decodeWithTrie): a new one-shot HTML/XML decode function that avoids the EntityDecoder class overhead for the common non-streaming case
  • Trie encoding overhaul: switches from a flat array-of-numbers format to a compact string-based encoding using base-91 alphabet, dictionary compression, delta+RLE for multi-char values, and offset-based branch navigation
  • Inline hot paths: the main decodeWithTrie loop and XML fast path are fully inlined — no function-call overhead for the critical decode path
  • XML fast path: hand-coded switch-based decoder for the 5 XML entities (amp, apos, gt, lt, quot), bypassing the trie entirely
  • decode-codepoint.ts: replaces the Map-based C1 replacement table with a flat array for faster index-based lookup
  • Semicolon optimization: semicolons are no longer encoded in the trie data, saving space and a comparison per node
  • Legacy numeric entities: decodeTrieNumeric correctly handles numeric entities without a trailing semicolon in non-strict mode (e.g. &#65xAx), matching the old EntityDecoder behavior

Breaking changes

  • None — the public API (decodeHTML, decodeXML, EntityDecoder) is unchanged. The streaming EntityDecoder class is preserved for backwards compat.
  • Generated trie data format in src/generated/decode-data-{html,xml}.ts changes (string-based instead of number array), but these are internal.

Benchmark results

Geometric mean across 12 text variants (short/medium/long × low/high entity density × named/numeric), via html-entity-benchmarks compare:

Category Before (ops/s) After (ops/s) Δ
decodeNormal 697.9K 924.1K +32%
decodeStrict 703.8K 934.1K +33%
decodeXML 897.9K 1.26M +41%

Gains are consistent across all input sizes. The XML fast path shows the largest improvement (+41%) due to the hand-coded switch bypassing the trie entirely.

Bundle size (esbuild --minify --tree-shaking, esm)

Export Before After Δ
decodeHTML 37,000 B (21,715 gz) 22,545 B (14,774 gz) −39% (−32% gz)
Full bundle 51,718 B (28,091 gz) 37,263 B (21,206 gz) −28% (−25% gz)

🤖 Generated with Claude Opus (Claude Code)

Summary by CodeRabbit

  • New Features

    • Better legacy numeric reference handling (semicolon-less cases) in legacy modes; added a compact XML-only fast-path for core entities.
    • Streaming/chunked decoding supported so decoders handle incremental input consistently.
  • Bug Fixes

    • More accurate termination and attribute-end validation for named and numeric entities.
  • Performance

    • Smaller on-disk trie encodings for tiny datasets and a new packed serialization to reduce runtime decode overhead.
  • Tests

    • Expanded test coverage for legacy, strict, attribute, and streaming decoding scenarios.

Copilot AI review requested due to automatic review settings March 29, 2026 11:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors entity decoding to add a non-streaming fast path, introduce a more compact generated trie representation, and inline/optimize hot decode paths (including a dedicated XML fast-path), while preserving the public API.

Changes:

  • Added a one-shot decodeWithTrie path and an XML switch-based fast path to reduce per-call overhead.
  • Reworked trie generation/encoding (relative pointers, implicit semicolon handling, and new packed-string decode data format).
  • Optimized numeric codepoint replacement by switching from a Map to a flat lookup array.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/internal/decode-shared.ts Adds decodeTrieDict to unpack new dictionary+delta encoded trie strings.
src/internal/bin-trie-flags.ts Switches BinTrieFlags to const enum for inlining.
src/generated/decode-data-xml.ts Emits small XML trie as an inline Uint16Array literal.
src/generated/decode-data-html.ts Switches HTML trie data to the new packed string + decodeTrieDict loader.
src/decode.ts Adds non-streaming decodeWithTrie, numeric fast path helper, and XML switch decoder; updates trie traversal to new pointer scheme.
src/decode-codepoint.ts Replaces the C1 replacement Map with a flat array lookup.
scripts/write-decode-map.ts Implements base-91 dictionary+delta trie data encoding and new codegen paths.
scripts/trie/trie.ts Removes explicit ; child nodes and encodes semicolon requirement via flags.
scripts/trie/encode-trie.ts Changes branch pointers to relative offsets (and jump-table slots to relative+1 with 0 sentinel).
scripts/trie/encode-trie.spec.ts Updates encoding tests to match the new relative pointer format.
scripts/trie/decode-trie.ts Updates trie decoder to follow new relative pointer format.
scripts/trie/decode-trie.spec.ts Updates expected merged maps to match new implicit-semicolon trie representation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +136 to +153
if (delta < 89) {
header += String.fromCharCode(SAFE[delta]);
} else {
const adjusted = delta - 89;
header +=
adjusted < BASE * BASE
? String.fromCharCode(
ESCAPE,
SAFE[Math.floor(adjusted / BASE)],
SAFE[adjusted % BASE],
)
: String.fromCharCode(
ESCAPE,
ESCAPE,
SAFE[Math.floor(adjusted / (BASE * BASE))],
SAFE[Math.floor(adjusted / BASE) % BASE],
SAFE[adjusted % BASE],
);
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deltaEncode() can emit the escape marker followed by SAFE[90] when adjusted is in [90*BASE, BASE*BASE) (e.g. 8190–8280). That produces an ESCAPE + ESCAPE + … prefix, which is indistinguishable from the intended double-escape encoding and will be decoded incorrectly by decodeTrieDict.

To fix this, ensure the 2-char escape form never uses SAFE[90] as its first digit (e.g. only use the 2-char form when Math.floor(adjusted / BASE) < 90, and fall back to the double-escape form otherwise), and update the comment range accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +30
* Format: [dict1: D×3 chars][dict2: delta var-len][data]
*
* - dict1: the D most-frequent values, each as 3 base-91 chars.
* - dict2: all remaining unique values, delta-encoded with variable-length base-91.
* Deltas < 90 → 1 char; 90–8370 → escape + 2 chars; larger → double-escape + 3 chars.
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment for encodeTrieData() says dict1: D×3 chars / “each as 3 base-91 chars”, but dict1 is currently delta+RLE variable-length encoded via deltaEncode(dict1) (no fixed 3-char encoding). This mismatch makes the on-disk format harder to reason about and risks future encoder/decoder drift.

Please update the comment to reflect the actual header format (dict1 + dict2 both delta/RLE var-len) and the real escape thresholds.

Suggested change
* Format: [dict1: D×3 chars][dict2: delta var-len][data]
*
* - dict1: the D most-frequent values, each as 3 base-91 chars.
* - dict2: all remaining unique values, delta-encoded with variable-length base-91.
* Deltas < 90 1 char; 90–8370 escape + 2 chars; larger double-escape + 3 chars.
* Format: [dict1: delta/RLE var-len][dict2: delta/RLE var-len][data]
*
* - dict1: base values for the D most-frequent entries, encoded with delta+RLE
* variable-length base-91.
* - dict2: all remaining unique values, encoded with the same delta+RLE
* variable-length base-91 scheme as dict1 (including its escape thresholds).

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +74
// Escape: next char(s) encode a larger delta
pos += 1;
const next = lookup[input.charCodeAt(pos)];
if (next < 90) {
previous +=
89 + next * base + lookup[input.charCodeAt(pos + 1)];
pos += 2;
} else {
// Double escape
pos += 1;
previous +=
89 +
lookup[input.charCodeAt(pos)] * 8281 +
lookup[input.charCodeAt(pos + 1)] * base +
lookup[input.charCodeAt(pos + 2)];
pos += 3;
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decodeDelta() treats ESCAPE as either a 2-char payload when next < 90 or a double-escape when next >= 90. But the encoder currently allows the first base-91 digit after ESCAPE to be SAFE[90], which will force the decoder down the double-escape branch and corrupt the decoded deltas.

Either constrain the encoder so that SAFE[90] never appears as the first digit in the 2-char escape form, or adjust the decoding scheme so the single-escape form can unambiguously represent that value.

Copilot uses AI. Check for mistakes.
src/decode.ts Outdated
Comment on lines +649 to +656
return NO_MATCH;
}

digits += 1;
offset += 1;
}

return NO_MATCH;
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decodeTrieNumeric() only returns a match when it encounters a semicolon. Previously, numeric character references were decoded in Legacy/Attribute modes even without a semicolon (only Strict mode rejected them), matching the existing EntityDecoder.emitNumericEntity() behavior.

This changes decodeHTML() / decodeHTMLAttribute() semantics (and conflicts with the PR description of no breaking changes). Consider mirroring the old behavior by allowing numeric entities to terminate on the first non-digit (or non-hex-digit) when strict === false, returning the consumed length without consuming the terminator character, and adding/adjusting tests (e.g. decodeHTML("&#x3a") === ":", decodeHTMLAttribute("&#65x") behavior).

Suggested change
return NO_MATCH;
}
digits += 1;
offset += 1;
}
return NO_MATCH;
// Non-digit / non-hex character terminates the numeric reference.
if (digits === 0) return NO_MATCH;
return [
offset - numberStart,
String.fromCodePoint(replaceCodePoint(cp)),
];
}
digits += 1;
offset += 1;
}
// Input ended; if we have at least one digit, accept the reference.
if (digits === 0) return NO_MATCH;
return [
offset - numberStart,
String.fromCodePoint(replaceCodePoint(cp)),
];

Copilot uses AI. Check for mistakes.
Comment on lines +120 to +160
/*
* Jump table: offset=48, length=10 (covers '0'..'9').
*
* [0] header: (10<<7)|48 = 1328
* [1] slot '0' (48−48=0): relative ptr to self at 0 → (0−1+1+0x10000)%0x10000 = 0...
* Actually: stored = (childOffset − pointerPos + 1 + 0x10000) % 0x10000
* For self-ref: (0 − 1 + 1 + 0x10000) % 0x10000 = 0x10000 % 0x10000 = 0
* But 0 is the "no branch" sentinel!
*
* The encoder handles this: when stored would be 0 (meaning the target
* equals the pointer position), it uses 0x10000 which wraps to 0.
* However, the decoder treats 0 as "no branch". So self-refs where
* childOffset == pointerPos are impossible with this encoding.
*
* Let's just verify structural properties.
*/
const result = encodeTrie(jumpRecursiveTrie);

expect(result).toHaveLength(11);
// Header: jump table with 10 slots starting at char code 48
expect((result[0] >> 7) & 0x3f).toBe(10); // Branch count = 10
expect(result[0] & 0x7f).toBe(48); // Jump offset = '0'

/*
* Slots at indices 1..10 for chars 48..57.
* Chars 50,51,53,55 (='2','3','5','7') have no branch → slot = 0.
*/
const slotFor = (char: number) => result[1 + (char - 48)];
expect(slotFor(50)).toBe(0); // '2' → no branch
expect(slotFor(51)).toBe(0); // '3' → no branch
expect(slotFor(53)).toBe(0); // '5' → no branch
expect(slotFor(55)).toBe(0); // '7' → no branch

/*
* Chars with branches all point back to self (index 0).
* resolved = (pointerPos + stored - 1) & 0xFFFF should equal 0.
*/
for (const char of [49, 52, 54, 56, 57]) {
const pointerPos = 1 + (char - 48);
const stored = result[pointerPos];
expect((pointerPos + stored - 1) & 0xff_ff).toBe(0);
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated test for the jump-table recursive case explicitly notes that the current encoding can produce a stored offset of 0, which collides with the “no branch” sentinel, and then it avoids asserting correctness for the '0' branch. As written, this test will pass even if the encoder silently drops a branch, and the explanatory comment claims the encoder “handles this” even though it cannot represent that pointer.

Either adjust the encoding to avoid generating 0 for valid branches (e.g. pick a different sentinel or remap stored === 0), or rewrite the test case to avoid an unrepresentable self-reference and assert the full expected behavior.

Copilot uses AI. Check for mistakes.
…e, fix delta encode boundary

- decodeTrieNumeric now accepts a `strict` parameter; in non-strict mode,
  numeric entities without a trailing semicolon are decoded (matching the
  old EntityDecoder behavior). Fixes decodeHTML("&#65x") → "Ax".
- Fix delta encode boundary: use `90 * BASE` instead of `BASE * BASE` to
  prevent escape marker collision when the first digit after ESCAPE equals
  ESCAPE itself.
- Update stale doc comment in write-decode-map.ts.
@fb55
Copy link
Copy Markdown
Owner Author

fb55 commented Mar 30, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

Refactors trie encoder/decoder pointer math to position-relative, introduces a packed dictionary+delta/RLE serialization with a runtime decoder, consolidates named-entity decoding into shared trie helpers, changes C1 codepoint replacement to an array, and adjusts tests to assert structural properties and streaming decoding behavior.

Changes

Cohort / File(s) Summary
Trie Decoder (runtime)
scripts/trie/decode-trie.ts, src/internal/decode-shared.ts, src/internal/bin-trie-flags.ts, src/decode.ts
Reworked trie decode internals: pointer math now uses position-relative, modulo-wrapped offsets; FLAG13/single-char value handling tightened; packed-key/jump-table traversal and dictionary lookups adjusted; added decodeTrieDict and replaced decodeBase64; changed BinTrieFlags to const enum; unified trie helpers (readTrieValue, decodeTrieNumeric, decodeWithTrie) and removed old trie factory.
Trie Encoder (build)
scripts/trie/encode-trie.ts, scripts/write-decode-map.ts, scripts/trie/trie.ts
Encoder changed to emit pointer-position-relative offsets (with wrapping) for jump-table and linear branches; removed special-case branch optimization; writer now emits a dictionary+delta/RLE base-91 string format, supports inline-literal small-tries, and adjusted encoding parameters; trie model no longer creates explicit ';' child nodes for legacy entries (uses semiRequired).
Tests — Trie
scripts/trie/decode-trie.spec.ts, scripts/trie/encode-trie.spec.ts
mergeMaps updated to conditionally merge legacy keys; encode tests now assert structural properties (length, headers, packed keys, pointer resolution, jump-table slots) instead of exact byte arrays; decode tests reflect decoder behavior changes.
Decoder — Entities & Streaming
src/decode.spec.ts, src/decode-codepoint.ts, src/decode.ts
Added parameterized tests running multiple implementations including a streaming EntityDecoder harness; extended tests for semicolon-less numeric/name cases and attribute-mode behavior; replaced Map-based C1 replacements with a fixed c1 array and treated codePoint === 0 as invalid; decodeXML reimplemented explicitly (no xml trie import).
Misc / Tests & Formatting
(various test files)
Type/format tweaks in tests (e.g., callback typing); no public API signatures changed.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped through offsets, wrapped and neat,
I packed dictionaries into a tiny treat,
I nudged semicolons to live in flags,
I streamed entities in tidy waggs,
a rabbit's nibble made the trie complete!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 65.52% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding a non-streaming decoder, optimizing the trie with a new string-based format, and inlining hot paths for performance improvements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch decode-perf

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/trie/encode-trie.spec.ts`:
- Around line 120-160: The test currently misses asserting the populated '0'
edge so it can pass even if that edge is dropped; update the spec to assert
every populated jump-table slot (including char code 48) instead of only
[49,52,54,56,57]. Specifically, after calling encodeTrie(jumpRecursiveTrie) use
the existing slotFor helper (and/or the loop that computes pointerPos/stored) to
add an assertion for char 48 that verifies (pointerPos + stored - 1) & 0xffff
=== 0 (same resolution check used for other branches). This ensures the test
will fail if the encoder or determineBranch drops the '0' edge or produces the
sentinel-collision layout.

In `@scripts/write-decode-map.ts`:
- Around line 63-85: Ensure the generator validates dictionary-size bounds
before emitting encoded data: check freq.size against DICT_SIZE and DICT_SIZE +
(BASE - DICT_SIZE) * BASE and fail fast or choose a safe fallback if out of
range. Specifically, in the code that builds dict1 and dict2Sorted (using
DICT_SIZE, BASE, freq, valueToCode, dict1, dict1Set, dict2Sorted) add a guard
that throws or returns an error when freq.size < DICT_SIZE or freq.size >
DICT_SIZE + (BASE - DICT_SIZE) * BASE so you never produce a table that
decode-shared.ts (which always reads DICT_SIZE entries for dict1 and expects
dict2 to fit in the two-char space) will be unable to map values (valueToCode
returning undefined).

In `@src/decode.ts`:
- Around line 868-881: The attribute end-char rejection should only run for
unterminated named references: change the combined condition so the
isEntityInAttributeInvalidEnd check is only applied when the reference is a
named reference (not numeric — e.g. input.charCodeAt(entityStart + 1) !== 35 for
'#') and when no semicolon was consumed (i.e. the matched text did not end with
';' — check the char at entityStart + consumed - 1 !== 59). In other words, in
the block around consumed, attribute, entityStart and
isEntityInAttributeInvalidEnd, add booleans like isNamedReference and
semicolonConsumed and only evaluate isEntityInAttributeInvalidEnd when attribute
&& isNamedReference && !semicolonConsumed && entityStart + consumed <
inputLength.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1fa5bd46-22dc-4c09-a986-8b3e9c001055

📥 Commits

Reviewing files that changed from the base of the PR and between 1dad19d and c0b60a4.

⛔ Files ignored due to path filters (2)
  • src/generated/decode-data-html.ts is excluded by !**/generated/**
  • src/generated/decode-data-xml.ts is excluded by !**/generated/**
📒 Files selected for processing (11)
  • scripts/trie/decode-trie.spec.ts
  • scripts/trie/decode-trie.ts
  • scripts/trie/encode-trie.spec.ts
  • scripts/trie/encode-trie.ts
  • scripts/trie/trie.ts
  • scripts/write-decode-map.ts
  • src/decode-codepoint.ts
  • src/decode.spec.ts
  • src/decode.ts
  • src/internal/bin-trie-flags.ts
  • src/internal/decode-shared.ts

Comment on lines +120 to +160
/*
* Jump table: offset=48, length=10 (covers '0'..'9').
*
* [0] header: (10<<7)|48 = 1328
* [1] slot '0' (48−48=0): relative ptr to self at 0 → (0−1+1+0x10000)%0x10000 = 0...
* Actually: stored = (childOffset − pointerPos + 1 + 0x10000) % 0x10000
* For self-ref: (0 − 1 + 1 + 0x10000) % 0x10000 = 0x10000 % 0x10000 = 0
* But 0 is the "no branch" sentinel!
*
* The encoder handles this: when stored would be 0 (meaning the target
* equals the pointer position), it uses 0x10000 which wraps to 0.
* However, the decoder treats 0 as "no branch". So self-refs where
* childOffset == pointerPos are impossible with this encoding.
*
* Let's just verify structural properties.
*/
const result = encodeTrie(jumpRecursiveTrie);

expect(result).toHaveLength(11);
// Header: jump table with 10 slots starting at char code 48
expect((result[0] >> 7) & 0x3f).toBe(10); // Branch count = 10
expect(result[0] & 0x7f).toBe(48); // Jump offset = '0'

/*
* Slots at indices 1..10 for chars 48..57.
* Chars 50,51,53,55 (='2','3','5','7') have no branch → slot = 0.
*/
const slotFor = (char: number) => result[1 + (char - 48)];
expect(slotFor(50)).toBe(0); // '2' → no branch
expect(slotFor(51)).toBe(0); // '3' → no branch
expect(slotFor(53)).toBe(0); // '5' → no branch
expect(slotFor(55)).toBe(0); // '7' → no branch

/*
* Chars with branches all point back to self (index 0).
* resolved = (pointerPos + stored - 1) & 0xFFFF should equal 0.
*/
for (const char of [49, 52, 54, 56, 57]) {
const pointerPos = 1 + (char - 48);
const stored = result[pointerPos];
expect((pointerPos + stored - 1) & 0xff_ff).toBe(0);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't let this spec bless a missing '0' edge.

This fixture creates a branch for 48, but in a jump table that entry needs the encoded slot to collapse to 0 to resolve back to index 0, and determineBranch() treats 0 as "no branch". Because the assertions only validate '1', '4', '6', '8', and '9', this test still passes if the '0' edge is dropped at runtime. Please either make the encoder avoid jump-table layout for that sentinel collision, or change the fixture so every populated slot is actually asserted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/trie/encode-trie.spec.ts` around lines 120 - 160, The test currently
misses asserting the populated '0' edge so it can pass even if that edge is
dropped; update the spec to assert every populated jump-table slot (including
char code 48) instead of only [49,52,54,56,57]. Specifically, after calling
encodeTrie(jumpRecursiveTrie) use the existing slotFor helper (and/or the loop
that computes pointerPos/stored) to add an assertion for char 48 that verifies
(pointerPos + stored - 1) & 0xffff === 0 (same resolution check used for other
branches). This ensures the test will fail if the encoder or determineBranch
drops the '0' edge or produces the sentinel-collision layout.

Comment on lines +63 to +85
// Count frequencies
const freq = new Map<number, number>();
for (const value of data) freq.set(value, (freq.get(value) ?? 0) + 1);
// @ts-expect-error `toSorted` requires a lib bump.
const sorted: [number, number][] = [...freq.entries()].toSorted(
(a: [number, number], b: [number, number]) => b[1] - a[1],
);

// Dict1: top D values → 1-char codes, sorted ascending for delta encoding
const dict1 = sorted
.slice(0, DICT_SIZE)
.map(([value]) => value)
// eslint-disable-next-line unicorn/no-array-sort -- TS doesn't know toSorted
.sort((a: number, b: number) => a - b);
const dict1Set = new Set(dict1);

// Dict2: remaining values, sorted ascending for delta encoding
const dict2Sorted = sorted
.filter(([value]: [number, number]) => !dict1Set.has(value))
.map(([value]: [number, number]) => value)
// eslint-disable-next-line unicorn/no-array-sort -- TS doesn't know toSorted
.sort((a: number, b: number) => a - b);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate the dictionary bounds before emitting encoded data.

src/internal/decode-shared.ts always decodes exactly DICT_SIZE entries for dict1, and the two-character dict2 space only has (BASE - DICT_SIZE) * BASE slots. If freq.size < DICT_SIZE or freq.size > DICT_SIZE + (BASE - DICT_SIZE) * BASE, this generator silently emits undecodable output, and valueToCode.get(value) can start returning undefined. Please fail fast or fall back before writing a corrupt table.

Suggested guard
     // Count frequencies
     const freq = new Map<number, number>();
     for (const value of data) freq.set(value, (freq.get(value) ?? 0) + 1);
     // `@ts-expect-error` `toSorted` requires a lib bump.
     const sorted: [number, number][] = [...freq.entries()].toSorted(
         (a: [number, number], b: [number, number]) => b[1] - a[1],
     );
+    const dict2Capacity = (BASE - DICT_SIZE) * BASE;
+    if (
+        sorted.length < DICT_SIZE ||
+        sorted.length > DICT_SIZE + dict2Capacity
+    ) {
+        throw new Error(
+            `Unsupported trie cardinality: ${sorted.length} unique values`,
+        );
+    }
@@
     // Encode data
     let encodedData = "";
     for (const value of data) {
-        encodedData += valueToCode.get(value);
+        const code = valueToCode.get(value);
+        if (code == null) {
+            throw new Error(`Missing encoded trie code for value ${value}`);
+        }
+        encodedData += code;
     }

Also applies to: 162-183

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/write-decode-map.ts` around lines 63 - 85, Ensure the generator
validates dictionary-size bounds before emitting encoded data: check freq.size
against DICT_SIZE and DICT_SIZE + (BASE - DICT_SIZE) * BASE and fail fast or
choose a safe fallback if out of range. Specifically, in the code that builds
dict1 and dict2Sorted (using DICT_SIZE, BASE, freq, valueToCode, dict1,
dict1Set, dict2Sorted) add a guard that throws or returns an error when
freq.size < DICT_SIZE or freq.size > DICT_SIZE + (BASE - DICT_SIZE) * BASE so
you never produce a table that decode-shared.ts (which always reads DICT_SIZE
entries for dict1 and expects dict2 to fit in the two-char space) will be unable
to map values (valueToCode returning undefined).

The isEntityInAttributeInvalidEnd check was incorrectly applied to all
entity matches in attribute mode. Per the HTML spec (§13.2.5.73) and
the streaming EntityDecoder, this rule only applies to non-semicolon-
terminated named references.

Fixes: decodeHTMLAttribute('&amp;x') now correctly returns '&x'
       decodeHTMLAttribute('&#65;x') now correctly returns 'Ax'
       decodeHTMLAttribute('&#65x') now correctly returns 'Ax'
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/decode.spec.ts (1)

43-100: 🧹 Nitpick | 🔵 Trivial

Add direct coverage for the new XML switch.

This diff expands HTML coverage, but the rewritten decodeXML() fast path is still only hitting amp and numeric refs here. lt, gt, quot, and apos remain untested, so a typo in the new switch would slip through.

Suggested coverage
+    it.each([
+        ["&lt;", "<"],
+        ["&gt;", ">"],
+        ["&quot;", '"'],
+        ["&apos;", "'"],
+    ])("should XML decode %s", (input, output) => {
+        expect(entities.decodeXML(input)).toBe(output);
+    });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/decode.spec.ts` around lines 43 - 100, Add direct tests that exercise the
new decodeXML() fast-path cases for the four basic XML entities so typos in the
switch are caught: call entities.decodeXML(...) and assert results for "&lt;" ->
"<", "&gt;" -> ">", "&quot;" -> '"', and "&apos;" -> "'" and also add variants
with a following character (e.g. "&ltx", "&gtx", "&quotx", "&aposx") to ensure
the legacy/following-char behavior is correct; place these alongside the
existing decodeHTML/decodeHTMLAttribute tests (using the same expectation-style
as entities.decodeHTML and entities.decodeHTMLAttribute) so the new switch arms
(lt, gt, quot, apos) are directly covered.
src/decode.ts (1)

308-345: ⚠️ Potential issue | 🟠 Major

Reject legacy fallback from inside compact runs in attribute mode.

Lines 395-403 already block legacy fallback once attribute parsing has consumed continuation text, but these compact-run mismatch exits skip that guard. src/decode.spec.ts Lines 63-65 show times has a longer timesbar; descendant; in DecodingMode.Attribute, &timesbX should stay literal because the first char after the matched times is b, but this path can still emit ×.

Possible fix
                     if (input.charCodeAt(offset) !== firstChar) {
+                        const rejectAttributeFallback =
+                            this.decodeMode === DecodingMode.Attribute &&
+                            this.result !== 0 &&
+                            this.excess > 0;
                         return this.result === 0
-                            ? 0
+                            || rejectAttributeFallback
+                            ? 0
                             : this.emitNotTerminatedNamedEntity();
                     }
@@
                     if (input.charCodeAt(offset) !== expectedChar) {
+                        const rejectAttributeFallback =
+                            this.decodeMode === DecodingMode.Attribute &&
+                            this.result !== 0 &&
+                            this.excess > 0;
                         this.runConsumed = 0;
                         return this.result === 0
-                            ? 0
+                            || rejectAttributeFallback
+                            ? 0
                             : this.emitNotTerminatedNamedEntity();
                     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/decode.ts` around lines 308 - 345, The compact-run mismatch branches
(both the initial firstChar check and the packed-word expectedChar check)
currently call the legacy fallback path (return 0 or
emitNotTerminatedNamedEntity) even when attribute parsing has already consumed
continuation text; add the same guard used elsewhere to reject legacy fallback
in attribute mode: detect DecodingMode.Attribute with the continuation/consumed
state (e.g., this.mode === DecodingMode.Attribute && this.excess > 0 or the same
condition used in the existing guard around lines ~395-403) and in that case
reset this.runConsumed and return 0 (treat as literal) instead of falling
through to emitNotTerminatedNamedEntity; apply this check in both mismatch
locations (inside the firstChar mismatch branch and inside the expectedChar
mismatch branch) so compact runs honor the attribute-mode fallback prohibition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/decode.ts`:
- Around line 373-393: The streaming path currently falls through and calls
determineBranch even when a node contains a packed inline value (valueLength ===
1), causing reinterpretation of value bits as branch metadata; update the logic
in decodeWithTrie (the block handling CharCodes.SEMI and valueLength) to stop
traversal when valueLength === 1 by immediately returning
this.emitNamedEntityData(this.treeIndex, valueLength, this.consumed +
this.excess) (same behavior as the non-streaming path) before calling
determineBranch so determineBranch(decodeTree, current, this.treeIndex +
(valueLength || 1), char) is only invoked for non-leaf nodes. Ensure the guard
uses the existing symbols valueLength, treeIndex, decodeTree, current, consumed
and excess.

---

Outside diff comments:
In `@src/decode.spec.ts`:
- Around line 43-100: Add direct tests that exercise the new decodeXML()
fast-path cases for the four basic XML entities so typos in the switch are
caught: call entities.decodeXML(...) and assert results for "&lt;" -> "<",
"&gt;" -> ">", "&quot;" -> '"', and "&apos;" -> "'" and also add variants with a
following character (e.g. "&ltx", "&gtx", "&quotx", "&aposx") to ensure the
legacy/following-char behavior is correct; place these alongside the existing
decodeHTML/decodeHTMLAttribute tests (using the same expectation-style as
entities.decodeHTML and entities.decodeHTMLAttribute) so the new switch arms
(lt, gt, quot, apos) are directly covered.

In `@src/decode.ts`:
- Around line 308-345: The compact-run mismatch branches (both the initial
firstChar check and the packed-word expectedChar check) currently call the
legacy fallback path (return 0 or emitNotTerminatedNamedEntity) even when
attribute parsing has already consumed continuation text; add the same guard
used elsewhere to reject legacy fallback in attribute mode: detect
DecodingMode.Attribute with the continuation/consumed state (e.g., this.mode ===
DecodingMode.Attribute && this.excess > 0 or the same condition used in the
existing guard around lines ~395-403) and in that case reset this.runConsumed
and return 0 (treat as literal) instead of falling through to
emitNotTerminatedNamedEntity; apply this check in both mismatch locations
(inside the firstChar mismatch branch and inside the expectedChar mismatch
branch) so compact runs honor the attribute-mode fallback prohibition.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6282019e-d931-4add-8387-579a5fc183b8

📥 Commits

Reviewing files that changed from the base of the PR and between c0b60a4 and a402eb4.

📒 Files selected for processing (2)
  • src/decode.spec.ts
  • src/decode.ts

Comment on lines 373 to 393
/*
* Implicit semicolon handling for nodes that require a semicolon but
* don't have an explicit ';' branch stored in the trie. If we have
* a value on the current node, it requires a semicolon, and the
* current input character is a semicolon, emit the entity using the
* current node (without descending further).
* Implicit semicolon handling: if the current node has a value and the
* input character is `;`, emit immediately. This covers both strict
* entities (FLAG13 set) and legacy entities (FLAG13 clear) — neither
* stores an explicit `;` branch in the trie.
*/
if (
char === CharCodes.SEMI &&
valueLength !== 0 &&
(current & BinTrieFlags.FLAG13) !== 0
) {
if (char === CharCodes.SEMI && valueLength !== 0) {
return this.emitNamedEntityData(
this.treeIndex,
valueLength,
this.consumed + this.excess,
);
}

// Navigate to the next node (valueLength || 1: skip past value words, minimum 1 for header).
this.treeIndex = determineBranch(
decodeTree,
current,
this.treeIndex + Math.max(1, valueLength),
this.treeIndex + (valueLength || 1),
char,
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Stop traversing once a packed inline value is reached.

decodeWithTrie() Lines 815-820 already treats valueLength === 1 nodes as leaf-only. The streaming path still falls through to determineBranch(), which reinterprets the packed code point bits as branch metadata. In attribute mode that can lose valid legacy matches: &amp& should still emit & because & is not an invalid attribute terminator, but after the bogus walk end() can return 0 because result !== treeIndex.

Suggested guard
             if (char === CharCodes.SEMI && valueLength !== 0) {
                 return this.emitNamedEntityData(
                     this.treeIndex,
                     valueLength,
                     this.consumed + this.excess,
                 );
             }
+
+            if (valueLength === 1) {
+                return this.result === 0 ||
+                    (this.decodeMode === DecodingMode.Attribute &&
+                        isEntityInAttributeInvalidEnd(char))
+                    ? 0
+                    : this.emitNotTerminatedNamedEntity();
+            }
 
             // Navigate to the next node (valueLength || 1: skip past value words, minimum 1 for header).
             this.treeIndex = determineBranch(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/decode.ts` around lines 373 - 393, The streaming path currently falls
through and calls determineBranch even when a node contains a packed inline
value (valueLength === 1), causing reinterpretation of value bits as branch
metadata; update the logic in decodeWithTrie (the block handling CharCodes.SEMI
and valueLength) to stop traversal when valueLength === 1 by immediately
returning this.emitNamedEntityData(this.treeIndex, valueLength, this.consumed +
this.excess) (same behavior as the non-streaming path) before calling
determineBranch so determineBranch(decodeTree, current, this.treeIndex +
(valueLength || 1), char) is only invoked for non-leaf nodes. Ensure the guard
uses the existing symbols valueLength, treeIndex, decodeTree, current, consumed
and excess.

Runs all decode tests against three implementations:
- sync: the new decodeWithTrie-based functions
- streaming (all at once): EntityDecoder wrapper, full string per write()
- streaming (char-by-char): EntityDecoder wrapper, one char per write()

This ensures the new non-streaming decoder matches EntityDecoder behavior
for all edge cases (legacy entities, attribute mode, numeric entities
without semicolons, etc.). Test count: 125 → 211.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/decode.ts (1)

315-320: ⚠️ Potential issue | 🟠 Major

Only fall back to a stored legacy match while the stream is still on that node.

These new this.result snapshots let the streaming decoder keep walking after a legacy match, but the compact-run mismatch exits and the branch-miss fallback can still call emitNotTerminatedNamedEntity() after this.treeIndex has moved on. In attribute mode that is too permissive: once this.result !== this.treeIndex, the first character after the legacy match was already part of a longer ASCII name, so the reference must be rejected. &notin, is a concrete divergence here — decodeHTMLAttribute() leaves it untouched, while EntityDecoder can still fall back to &not. Reuse the same this.result === this.treeIndex invariant that end() already applies before any non-semicolon fallback.

Also applies to: 341-345, 357-366, 395-404, 410-425

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/decode.ts` around lines 315 - 320, The fallback that calls
emitNotTerminatedNamedEntity() must only be used while the stream is still on
the same trie node; change the mismatch branches that currently do "return
this.result === 0 ? 0 : this.emitNotTerminatedNamedEntity()" to only emit when
this.result === this.treeIndex (i.e. if this.result === 0 or this.result !==
this.treeIndex then return 0, otherwise call emitNotTerminatedNamedEntity()).
Apply this same change to all the analogous spots mentioned (the branches around
runConsumed, where runConsumed === 0 and the other mismatches at the ranges
noted) so the code checks the invariant this.result === this.treeIndex before
falling back to the legacy match.
♻️ Duplicate comments (1)
src/decode.ts (1)

379-393: ⚠️ Potential issue | 🔴 Critical

Stop at inline-value leaf nodes before calling determineBranch().

valueLength === 1 means the decoded code point is packed into the header word itself, so there is no branch metadata left to read. Falling through here still reinterprets that payload as BRANCH_LENGTH/JUMP_TABLE, which can send the streaming decoder into unrelated nodes and break cases like &amp&. This was flagged earlier and the guard is still missing. Please add a regression case for that input once fixed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/decode.ts` around lines 379 - 393, The decoder must stop at inline-value
leaf nodes (valueLength === 1) before calling determineBranch; modify the logic
around CharCodes.SEMI handling so that when valueLength !== 0 you call
emitNamedEntityData (as already done) and additionally when valueLength === 1
you also return emitNamedEntityData(this.treeIndex, valueLength, this.consumed +
this.excess) to avoid falling through into determineBranch; update the block
that sets this.treeIndex = determineBranch(decodeTree, current, this.treeIndex +
(valueLength || 1), char) to only run when valueLength !== 1, and add a
regression test for the input "&amp&" to ensure the decoder no longer walks into
unrelated nodes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/decode.ts`:
- Around line 315-320: The fallback that calls emitNotTerminatedNamedEntity()
must only be used while the stream is still on the same trie node; change the
mismatch branches that currently do "return this.result === 0 ? 0 :
this.emitNotTerminatedNamedEntity()" to only emit when this.result ===
this.treeIndex (i.e. if this.result === 0 or this.result !== this.treeIndex then
return 0, otherwise call emitNotTerminatedNamedEntity()). Apply this same change
to all the analogous spots mentioned (the branches around runConsumed, where
runConsumed === 0 and the other mismatches at the ranges noted) so the code
checks the invariant this.result === this.treeIndex before falling back to the
legacy match.

---

Duplicate comments:
In `@src/decode.ts`:
- Around line 379-393: The decoder must stop at inline-value leaf nodes
(valueLength === 1) before calling determineBranch; modify the logic around
CharCodes.SEMI handling so that when valueLength !== 0 you call
emitNamedEntityData (as already done) and additionally when valueLength === 1
you also return emitNamedEntityData(this.treeIndex, valueLength, this.consumed +
this.excess) to avoid falling through into determineBranch; update the block
that sets this.treeIndex = determineBranch(decodeTree, current, this.treeIndex +
(valueLength || 1), char) to only run when valueLength !== 1, and add a
regression test for the input "&amp&" to ensure the decoder no longer walks into
unrelated nodes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a5da75bd-2383-4873-9758-f1fba702346c

📥 Commits

Reviewing files that changed from the base of the PR and between a402eb4 and e76f13e.

📒 Files selected for processing (2)
  • src/decode.spec.ts
  • src/decode.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants