refactor(encode): optimize trie, pick shortest entities, use base-10 numerics#2198
refactor(encode): optimize trie, pick shortest entities, use base-10 numerics#2198
Conversation
There was a problem hiding this comment.
Pull request overview
Refactors HTML/XML encoding to use a more compact generated encode trie representation, adds an ASCII fast-path, and switches numeric entity output from hex (&#x...;) to decimal (&#...;) across encoding/escaping APIs.
Changes:
- Replaces generated
htmlTrie: Mapwith a default-exported serialized string parsed at startup, plus anasciiEntities[]lookup table for faster ASCII encoding. - Changes numeric entity fallback output from hexadecimal to decimal in
encodeHTML/encodeNonAsciiHTMLandencodeXML(and updates affected tests). - Removes
src/internal/encode-shared.tsby inlining the trie parsing logic intosrc/encode.tsand updating the generator script.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/internal/encode-shared.ts | Deleted; trie parsing logic moved into src/encode.ts. |
| src/encode.ts | Adds runtime parsing of serialized trie + ASCII fast-path; switches numeric refs to decimal. |
| src/generated/encode-html.ts | Now exports a compact serialized string instead of a parsed Map. |
| scripts/write-encode-map.ts | Updates generator to serialize in new format; picks shortest entity name; strips ASCII children. |
| src/escape.ts | Switches encodeXML numeric output to decimal. |
| src/index.spec.ts | Updates expectations for decimal numeric entities. |
| src/encode.spec.ts | Updates expectations for decimal numeric entities. |
| src/decode.spec.ts | Formatting-only change (type annotation wrapping). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const entityValue = | ||
| data.charCodeAt(cursor) === 123 /* '{' */ ? null : readEntity(); | ||
|
|
||
| if (data.charCodeAt(cursor) === 123 /* '{' */) { | ||
| cursor++; // Skip '{' |
There was a problem hiding this comment.
The parser only writes to asciiEntities[...] when an ASCII entry has no children. If you ever decide to keep ASCII starter children in the serialized trie (instead of stripping them), the encoder would still go through the ASCII fast-path and would miss those values unless asciiEntities is populated for branch nodes too. Consider either enforcing "no ASCII children" as an invariant (and documenting it), or always populating asciiEntities when lastKey < 0x80 and an entityValue is present.
…dd regression tests Keep ASCII trie children (e.g. < + U+20D2 → <⃒, = + U+20E5 → =⃥, > + U+20D2 → >⃒) in the serialized data. The encoder now checks the trie for multi-char matches before falling back to the ASCII fast-path.
- Restore JSDoc comments on TrieNode interface fields (write-encode-map.ts) - Rename `ai` → `astralIndex` for clarity - Restructure encoder: ASCII fast-path first, trie lookup only for multi-code-point children fallback - Remove unnecessary bare block wrapper
📝 WalkthroughWalkthroughRuntime now builds the HTML encode trie from a serialized JSON string instead of importing a pre-parsed trie; numeric character references changed from hexadecimal to decimal across encoding/escaping; Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/encode.ts (1)
158-159:⚠️ Potential issue | 🟡 MinorJSDoc still references hexadecimal format.
The JSDoc comments for
encodeHTML(line 158-159) andencodeNonAsciiHTML(line 170-171) still say "numeric hexadecimal reference (eg.ü)" but the implementation now emits decimal references.📝 Proposed fix to update JSDoc
/** * Encodes all characters in the input using HTML entities. This includes * characters that are valid ASCII characters in HTML documents, such as `#`. * * To get a more compact output, consider using the `encodeNonAsciiHTML` * function, which will only encode characters that are not valid in HTML * documents, as well as non-ASCII characters. * - * If a character has no equivalent entity, a numeric hexadecimal reference - * (eg. `ü`) will be used. + * If a character has no equivalent entity, a numeric decimal reference + * (eg. `ü`) will be used. * `@param` input Input string to encode or decode. */ export function encodeHTML(input: string): string {/** * Encodes all non-ASCII characters, as well as characters not valid in HTML * documents using HTML entities. This function will not encode characters that * are valid in HTML documents, such as `#`. * - * If a character has no equivalent entity, a numeric hexadecimal reference - * (eg. `ü`) will be used. + * If a character has no equivalent entity, a numeric decimal reference + * (eg. `ü`) will be used. * `@param` input Input string to encode or decode. */ export function encodeNonAsciiHTML(input: string): string {Also applies to: 170-171
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/encode.ts` around lines 158 - 159, Update the JSDoc for encodeHTML and encodeNonAsciiHTML to reflect that the implementation emits numeric decimal references (e.g. ü or ü in decimal form) instead of hexadecimal; locate the JSDoc blocks above the encodeHTML and encodeNonAsciiHTML function declarations and change the phrase "numeric hexadecimal reference (eg. `ü`)" to something like "numeric decimal reference (eg. `ü`) or simply "numeric character reference (decimal)". Ensure both encodeHTML and encodeNonAsciiHTML comments are updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/encode.ts`:
- Around line 158-159: Update the JSDoc for encodeHTML and encodeNonAsciiHTML to
reflect that the implementation emits numeric decimal references (e.g. ü or
ü in decimal form) instead of hexadecimal; locate the JSDoc blocks above
the encodeHTML and encodeNonAsciiHTML function declarations and change the
phrase "numeric hexadecimal reference (eg. `ü`)" to something like "numeric
decimal reference (eg. `ü`) or simply "numeric character reference
(decimal)". Ensure both encodeHTML and encodeNonAsciiHTML comments are updated
consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d9108289-fe74-464a-b161-e7ee09bd6288
⛔ Files ignored due to path filters (1)
src/generated/encode-html.tsis excluded by!**/generated/**
📒 Files selected for processing (7)
scripts/write-encode-map.tssrc/decode.spec.tssrc/encode.spec.tssrc/encode.tssrc/escape.tssrc/index.spec.tssrc/internal/encode-shared.ts
💤 Files with no reviewable changes (1)
- src/internal/encode-shared.ts
The implementation was changed from hex (ü) to decimal (ü) but the JSDoc comments still referenced hexadecimal.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/encode.ts`:
- Line 245: Replace the deprecated call to input.substr with input.substring to
avoid relying on Annex B legacy features; locate the expression using
input.substr(last) (the variables input, last, length, out are nearby) and
change it so the remaining tail of the string is appended using
input.substring(last) while preserving the existing behavior and assignment to
out.
- Around line 88-91: The child-parsing loop in readEntity (uses readGap and
childKey/next) assumes children are flat (only leaf entities) and doesn't
recurse into nested `{...}` blocks; update the code by either (A) adding a clear
comment at the readEntity loop documenting the bounded-depth assumption (that
all entities are 2 code-points and trie depth ≤2), and (B) add a validation in
scripts/write-encode-map.ts that the generated encode trie never exceeds depth 2
(fail the build or throw a clear error if deeper nodes are produced) so the
assumption is enforced upstream. Ensure references to readEntity, readGap,
childKey, next are mentioned in the comment and the validation checks the trie
depth when constructing/writing the map.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| while (data.charCodeAt(cursor) !== 125 /* '}' */) { | ||
| childKey += readGap() + 1; | ||
| next.set(childKey, readEntity()); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Child parsing assumes flat structure (depth-1 children only).
The child parsing loop reads only leaf entities and doesn't recursively parse nested {...} blocks. This works correctly for current HTML5 entities (all multi-code-point entities have exactly 2 code points), but if a 3+ code-point entity were added in the future, the parser would malfunction.
Consider adding a brief comment to document this assumption, or verify upstream (in scripts/write-encode-map.ts) that the trie depth is bounded.
📝 Suggested documentation
while (data.charCodeAt(cursor) !== 125 /* '}' */) {
childKey += readGap() + 1;
+ // NB: Children are assumed to be leaf nodes (no nested {...} blocks).
+ // This holds for all HTML5 entities (max 2 code points per entity).
next.set(childKey, readEntity());
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/encode.ts` around lines 88 - 91, The child-parsing loop in readEntity
(uses readGap and childKey/next) assumes children are flat (only leaf entities)
and doesn't recurse into nested `{...}` blocks; update the code by either (A)
adding a clear comment at the readEntity loop documenting the bounded-depth
assumption (that all entities are 2 code-points and trie depth ≤2), and (B) add
a validation in scripts/write-encode-map.ts that the generated encode trie never
exceeds depth 2 (fail the build or throw a clear error if deeper nodes are
produced) so the assumption is enforced upstream. Ensure references to
readEntity, readGap, childKey, next are mentioned in the comment and the
validation checks the trie depth when constructing/writing the map.
| if (out === undefined) return input; | ||
| // If nothing needed encoding, return the original string (avoids allocation). | ||
| if (out == null) return input; | ||
| if (last < length) out += input.substr(last); |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Prefer substring over deprecated substr.
String.prototype.substr is specified in ECMAScript Annex B (legacy features) and is not recommended for new code. Both methods behave identically when called with a single argument.
♻️ Suggested fix
- if (last < length) out += input.substr(last);
+ if (last < length) out += input.substring(last);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (last < length) out += input.substr(last); | |
| if (last < length) out += input.substring(last); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/encode.ts` at line 245, Replace the deprecated call to input.substr with
input.substring to avoid relying on Annex B legacy features; locate the
expression using input.substr(last) (the variables input, last, length, out are
nearby) and change it so the remaining tail of the string is appended using
input.substring(last) while preserving the existing behavior and assignment to
out.
Summary
&over&)&#xHEX;to&#DEC;— shorter output for most code points (e.g.❤vs❤)htmlTrieMap fromencode-html.tswith a compact serialized string format parsed at startup into aMap<number, EncodeTrieNode>+ a flatasciiEntities[]lookup table for fast ASCII pathssrc/internal/encode-shared.ts: all encode trie logic now lives inencode.ts<⃒(<+ U+20D2),>⃒(>+ U+20D2), and=⃥(=+ U+20E5) are kept in the trie and the encoder checks for multi-char matches before the ASCII fast pathBreaking changes
❤becomes❤. Both are valid HTML/XML, but snapshot tests matching exact output will need updating.&and&map to the same code point, the shorter one is now always chosen.escapeUTF8/encodeXMLfunctions inescape.tsalso produce decimal numeric entities (&#...;instead of&#x...;).src/internal/encode-shared.tsis removed.Benchmark results
Geometric mean across 12 text variants (short/medium/long × low/high entity density × named/numeric), via html-entity-benchmarks
compare:Biggest gain on short-low-named inputs (2.94M → 7.65M, +2.6×) from the ASCII fast-path array lookup. Longer/higher-density inputs are roughly at parity.
Bundle size (esbuild --minify --tree-shaking, esm)
encodeHTML🤖 Generated with Claude Opus (Claude Code)
Summary by CodeRabbit