Skip to content

refactor(encode): optimize trie, pick shortest entities, use base-10 numerics#2198

Open
fb55 wants to merge 5 commits intomainfrom
encode-perf
Open

refactor(encode): optimize trie, pick shortest entities, use base-10 numerics#2198
fb55 wants to merge 5 commits intomainfrom
encode-perf

Conversation

@fb55
Copy link
Copy Markdown
Owner

@fb55 fb55 commented Mar 29, 2026

Summary

  • Pick shortest entity on encode: the encode-trie generator now selects the shortest named entity when multiple aliases exist for the same code point (e.g. & over &)
  • Base-10 numeric entities: switches fallback numeric encoding from &#xHEX; to &#DEC; — shorter output for most code points (e.g. ❤ vs ❤)
  • Trie overhaul: replaces the imported htmlTrie Map from encode-html.ts with a compact serialized string format parsed at startup into a Map<number, EncodeTrieNode> + a flat asciiEntities[] lookup table for fast ASCII paths
  • Deleted src/internal/encode-shared.ts: all encode trie logic now lives in encode.ts
  • Multi-code-point entities preserved: entities like &nvlt; (< + U+20D2), &nvgt; (> + U+20D2), and &bne; (= + U+20E5) are kept in the trie and the encoder checks for multi-char matches before the ASCII fast path

Breaking changes

  • Numeric entity output changes from hex to decimal&#x2764; becomes &#10084;. Both are valid HTML/XML, but snapshot tests matching exact output will need updating.
  • Named entity output may change when a shorter alias exists — e.g. if both &AMP; and &amp; map to the same code point, the shorter one is now always chosen.
  • The escapeUTF8 / encodeXML functions in escape.ts also produce decimal numeric entities (&#...; instead of &#x...;).
  • src/internal/encode-shared.ts is removed.

Benchmark results

Geometric mean across 12 text variants (short/medium/long × low/high entity density × named/numeric), via html-entity-benchmarks compare:

Geo mean (ops/s) Δ
entities v8 (published) 541.5K
encode-perf 582.9K +8%

Biggest gain on short-low-named inputs (2.94M → 7.65M, +2.6×) from the ASCII fast-path array lookup. Longer/higher-density inputs are roughly at parity.

Bundle size (esbuild --minify --tree-shaking, esm)

Export Before After Δ
encodeHTML 14,455 B (6,242 gz) 11,150 B (5,328 gz) −23% (−15% gz)
Full bundle 51,718 B (28,091 gz) 48,605 B (27,320 gz) −6% (−3% gz)

🤖 Generated with Claude Opus (Claude Code)

Summary by CodeRabbit

  • Refactor
    • Numeric character references now use decimal form instead of hexadecimal.
  • Bug Fixes
    • Improved handling of full Unicode (including astral/surrogate pairs) and multi-code-point sequences so emoji and combined characters encode correctly.
  • Tests
    • Updated and added tests to expect decimal references and to cover multi-code-point/edge-case sequences.

Copilot AI review requested due to automatic review settings March 29, 2026 11:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors HTML/XML encoding to use a more compact generated encode trie representation, adds an ASCII fast-path, and switches numeric entity output from hex (&#x...;) to decimal (&#...;) across encoding/escaping APIs.

Changes:

  • Replaces generated htmlTrie: Map with a default-exported serialized string parsed at startup, plus an asciiEntities[] lookup table for faster ASCII encoding.
  • Changes numeric entity fallback output from hexadecimal to decimal in encodeHTML/encodeNonAsciiHTML and encodeXML (and updates affected tests).
  • Removes src/internal/encode-shared.ts by inlining the trie parsing logic into src/encode.ts and updating the generator script.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/internal/encode-shared.ts Deleted; trie parsing logic moved into src/encode.ts.
src/encode.ts Adds runtime parsing of serialized trie + ASCII fast-path; switches numeric refs to decimal.
src/generated/encode-html.ts Now exports a compact serialized string instead of a parsed Map.
scripts/write-encode-map.ts Updates generator to serialize in new format; picks shortest entity name; strips ASCII children.
src/escape.ts Switches encodeXML numeric output to decimal.
src/index.spec.ts Updates expectations for decimal numeric entities.
src/encode.spec.ts Updates expectations for decimal numeric entities.
src/decode.spec.ts Formatting-only change (type annotation wrapping).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +82 to +86
const entityValue =
data.charCodeAt(cursor) === 123 /* '{' */ ? null : readEntity();

if (data.charCodeAt(cursor) === 123 /* '{' */) {
cursor++; // Skip '{'
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser only writes to asciiEntities[...] when an ASCII entry has no children. If you ever decide to keep ASCII starter children in the serialized trie (instead of stripping them), the encoder would still go through the ASCII fast-path and would miss those values unless asciiEntities is populated for branch nodes too. Consider either enforcing "no ASCII children" as an invariant (and documenting it), or always populating asciiEntities when lastKey < 0x80 and an entityValue is present.

Copilot uses AI. Check for mistakes.
fb55 added 3 commits March 29, 2026 12:40
…dd regression tests

Keep ASCII trie children (e.g. < + U+20D2 → &nvlt;, = + U+20E5 → &bne;,
> + U+20D2 → &nvgt;) in the serialized data. The encoder now checks the
trie for multi-char matches before falling back to the ASCII fast-path.
- Restore JSDoc comments on TrieNode interface fields (write-encode-map.ts)
- Rename `ai` → `astralIndex` for clarity
- Restructure encoder: ASCII fast-path first, trie lookup only for
  multi-code-point children fallback
- Remove unnecessary bare block wrapper
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

📝 Walkthrough

Walkthrough

Runtime now builds the HTML encode trie from a serialized JSON string instead of importing a pre-parsed trie; numeric character references changed from hexadecimal to decimal across encoding/escaping; encode-shared was removed; tests and the trie-generation script were updated accordingly.

Changes

Cohort / File(s) Summary
Trie generation script
scripts/write-encode-map.ts
Now emits a JSON-serialized encode-html.ts (base-10 numeric format) instead of embedding parseEncodeTrie(...) and encode-shared references; logging updated to include data length.
Runtime trie parsing & encoder
src/encode.ts
Removed external trie import; added in-module EncodeTrieNode type, runtime parsing of htmlTrieData into Map<number, EncodeTrieNode>, asciiEntities table, and multi-code-point matching; switched numeric fallbacks from hex to decimal and switched to codePointAt with surrogate handling; lazy out allocation adjusted.
Removed shared utility
src/internal/encode-shared.ts
Deleted: previously exported EncodeTrieNode type and parseEncodeTrie(serialized) implementation removed.
XML escaping
src/escape.ts
Changed XML numeric references from hex (&#x...;) to decimal (&#...;); uses codePointAt for code points and preserves surrogate handling.
Tests — encoding/escaping expectations
src/encode.spec.ts, src/index.spec.ts
Updated assertions to expect decimal numeric references instead of hexadecimal; added tests for ASCII + combining/multi-code-point sequences; emoji and surrogate-related expectations adjusted.
Tests — minor formatting
src/decode.spec.ts
Type annotation formatting for a test-local callback variable changed (no runtime behavior change).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I nibble hex and hop to ten,
Trie unrolled at runtime again,
Shared file shelved, entities renewed,
Decimal signals in every queued.
Hooray — small hops, big changes, chew!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main objectives: trie optimization, selecting shortest entities, and switching to base-10 numerics.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch encode-perf

Comment @coderabbitai help to get the list of available commands and usage tips.

@fb55
Copy link
Copy Markdown
Owner Author

fb55 commented Mar 30, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/encode.ts (1)

158-159: ⚠️ Potential issue | 🟡 Minor

JSDoc still references hexadecimal format.

The JSDoc comments for encodeHTML (line 158-159) and encodeNonAsciiHTML (line 170-171) still say "numeric hexadecimal reference (eg. &#xfc;)" but the implementation now emits decimal references.

📝 Proposed fix to update JSDoc
 /**
  * Encodes all characters in the input using HTML entities. This includes
  * characters that are valid ASCII characters in HTML documents, such as `#`.
  *
  * To get a more compact output, consider using the `encodeNonAsciiHTML`
  * function, which will only encode characters that are not valid in HTML
  * documents, as well as non-ASCII characters.
  *
- * If a character has no equivalent entity, a numeric hexadecimal reference
- * (eg. `&#xfc;`) will be used.
+ * If a character has no equivalent entity, a numeric decimal reference
+ * (eg. `&#252;`) will be used.
  * `@param` input Input string to encode or decode.
  */
 export function encodeHTML(input: string): string {
 /**
  * Encodes all non-ASCII characters, as well as characters not valid in HTML
  * documents using HTML entities. This function will not encode characters that
  * are valid in HTML documents, such as `#`.
  *
- * If a character has no equivalent entity, a numeric hexadecimal reference
- * (eg. `&#xfc;`) will be used.
+ * If a character has no equivalent entity, a numeric decimal reference
+ * (eg. `&#252;`) will be used.
  * `@param` input Input string to encode or decode.
  */
 export function encodeNonAsciiHTML(input: string): string {

Also applies to: 170-171

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/encode.ts` around lines 158 - 159, Update the JSDoc for encodeHTML and
encodeNonAsciiHTML to reflect that the implementation emits numeric decimal
references (e.g. &#252; or &#xfc; in decimal form) instead of hexadecimal;
locate the JSDoc blocks above the encodeHTML and encodeNonAsciiHTML function
declarations and change the phrase "numeric hexadecimal reference (eg.
`&#xfc;`)" to something like "numeric decimal reference (eg. `&#252;`) or simply
"numeric character reference (decimal)". Ensure both encodeHTML and
encodeNonAsciiHTML comments are updated consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/encode.ts`:
- Around line 158-159: Update the JSDoc for encodeHTML and encodeNonAsciiHTML to
reflect that the implementation emits numeric decimal references (e.g. &#252; or
&#xfc; in decimal form) instead of hexadecimal; locate the JSDoc blocks above
the encodeHTML and encodeNonAsciiHTML function declarations and change the
phrase "numeric hexadecimal reference (eg. `&#xfc;`)" to something like "numeric
decimal reference (eg. `&#252;`) or simply "numeric character reference
(decimal)". Ensure both encodeHTML and encodeNonAsciiHTML comments are updated
consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9108289-fe74-464a-b161-e7ee09bd6288

📥 Commits

Reviewing files that changed from the base of the PR and between 1dad19d and 43b47f0.

⛔ Files ignored due to path filters (1)
  • src/generated/encode-html.ts is excluded by !**/generated/**
📒 Files selected for processing (7)
  • scripts/write-encode-map.ts
  • src/decode.spec.ts
  • src/encode.spec.ts
  • src/encode.ts
  • src/escape.ts
  • src/index.spec.ts
  • src/internal/encode-shared.ts
💤 Files with no reviewable changes (1)
  • src/internal/encode-shared.ts

The implementation was changed from hex (&#xfc;) to decimal (&#252;)
but the JSDoc comments still referenced hexadecimal.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/encode.ts`:
- Line 245: Replace the deprecated call to input.substr with input.substring to
avoid relying on Annex B legacy features; locate the expression using
input.substr(last) (the variables input, last, length, out are nearby) and
change it so the remaining tail of the string is appended using
input.substring(last) while preserving the existing behavior and assignment to
out.
- Around line 88-91: The child-parsing loop in readEntity (uses readGap and
childKey/next) assumes children are flat (only leaf entities) and doesn't
recurse into nested `{...}` blocks; update the code by either (A) adding a clear
comment at the readEntity loop documenting the bounded-depth assumption (that
all entities are 2 code-points and trie depth ≤2), and (B) add a validation in
scripts/write-encode-map.ts that the generated encode trie never exceeds depth 2
(fail the build or throw a clear error if deeper nodes are produced) so the
assumption is enforced upstream. Ensure references to readEntity, readGap,
childKey, next are mentioned in the comment and the validation checks the trie
depth when constructing/writing the map.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b2278f83-9025-4933-90f8-a71966747584

📥 Commits

Reviewing files that changed from the base of the PR and between 43b47f0 and d67be1f.

📒 Files selected for processing (1)
  • src/encode.ts

Comment on lines +88 to +91
while (data.charCodeAt(cursor) !== 125 /* '}' */) {
childKey += readGap() + 1;
next.set(childKey, readEntity());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Child parsing assumes flat structure (depth-1 children only).

The child parsing loop reads only leaf entities and doesn't recursively parse nested {...} blocks. This works correctly for current HTML5 entities (all multi-code-point entities have exactly 2 code points), but if a 3+ code-point entity were added in the future, the parser would malfunction.

Consider adding a brief comment to document this assumption, or verify upstream (in scripts/write-encode-map.ts) that the trie depth is bounded.

📝 Suggested documentation
 while (data.charCodeAt(cursor) !== 125 /* '}' */) {
     childKey += readGap() + 1;
+    // NB: Children are assumed to be leaf nodes (no nested {...} blocks).
+    // This holds for all HTML5 entities (max 2 code points per entity).
     next.set(childKey, readEntity());
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/encode.ts` around lines 88 - 91, The child-parsing loop in readEntity
(uses readGap and childKey/next) assumes children are flat (only leaf entities)
and doesn't recurse into nested `{...}` blocks; update the code by either (A)
adding a clear comment at the readEntity loop documenting the bounded-depth
assumption (that all entities are 2 code-points and trie depth ≤2), and (B) add
a validation in scripts/write-encode-map.ts that the generated encode trie never
exceeds depth 2 (fail the build or throw a clear error if deeper nodes are
produced) so the assumption is enforced upstream. Ensure references to
readEntity, readGap, childKey, next are mentioned in the comment and the
validation checks the trie depth when constructing/writing the map.

if (out === undefined) return input;
// If nothing needed encoding, return the original string (avoids allocation).
if (out == null) return input;
if (last < length) out += input.substr(last);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Prefer substring over deprecated substr.

String.prototype.substr is specified in ECMAScript Annex B (legacy features) and is not recommended for new code. Both methods behave identically when called with a single argument.

♻️ Suggested fix
-    if (last < length) out += input.substr(last);
+    if (last < length) out += input.substring(last);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (last < length) out += input.substr(last);
if (last < length) out += input.substring(last);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/encode.ts` at line 245, Replace the deprecated call to input.substr with
input.substring to avoid relying on Annex B legacy features; locate the
expression using input.substr(last) (the variables input, last, length, out are
nearby) and change it so the remaining tail of the string is appended using
input.substring(last) while preserving the existing behavior and assignment to
out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants