refactor(encode): optimize trie, pick shortest entities, use base-10 numerics by fb55 · Pull Request #2198 · fb55/entities

fb55 · 2026-03-29T11:28:45Z

Summary

Pick shortest entity on encode: the encode-trie generator now selects the shortest named entity when multiple aliases exist for the same code point (e.g. & over &AMP;)
Base-10 numeric entities: switches fallback numeric encoding from &#xHEX; to &#DEC; — shorter output for most code points (e.g. ❤ vs ❤)
Trie overhaul: replaces the imported htmlTrie Map from encode-html.ts with a compact serialized string format parsed at startup into a Map<number, EncodeTrieNode> + a flat asciiEntities[] lookup table for fast ASCII paths
Deleted src/internal/encode-shared.ts: all encode trie logic now lives in encode.ts
Multi-code-point entities preserved: entities like &nvlt; (< + U+20D2), &nvgt; (> + U+20D2), and &bne; (= + U+20E5) are kept in the trie and the encoder checks for multi-char matches before the ASCII fast path

Breaking changes

Numeric entity output changes from hex to decimal — ❤ becomes ❤. Both are valid HTML/XML, but snapshot tests matching exact output will need updating.
Named entity output may change when a shorter alias exists — e.g. if both &AMP; and & map to the same code point, the shorter one is now always chosen.
The escapeUTF8 / encodeXML functions in escape.ts also produce decimal numeric entities (&#...; instead of &#x...;).
src/internal/encode-shared.ts is removed.

Benchmark results

Geometric mean across 12 text variants (short/medium/long × low/high entity density × named/numeric), via html-entity-benchmarks compare:

	Geo mean (ops/s)	Δ
entities v8 (published)	541.5K	—
encode-perf	582.9K	+8%

Biggest gain on short-low-named inputs (2.94M → 7.65M, +2.6×) from the ASCII fast-path array lookup. Longer/higher-density inputs are roughly at parity.

Bundle size (esbuild --minify --tree-shaking, esm)

Export	Before	After	Δ
`encodeHTML`	14,455 B (6,242 gz)	11,150 B (5,328 gz)	−23% (−15% gz)
Full bundle	51,718 B (28,091 gz)	48,605 B (27,320 gz)	−6% (−3% gz)

🤖 Generated with Claude Opus (Claude Code)

Summary by CodeRabbit

Refactor
- Numeric character references now use decimal form instead of hexadecimal.
Bug Fixes
- Improved handling of full Unicode (including astral/surrogate pairs) and multi-code-point sequences so emoji and combined characters encode correctly.
Tests
- Updated and added tests to expect decimal references and to cover multi-code-point/edge-case sequences.

…s, optimize trie

Copilot

Pull request overview

Refactors HTML/XML encoding to use a more compact generated encode trie representation, adds an ASCII fast-path, and switches numeric entity output from hex (&#x...;) to decimal (&#...;) across encoding/escaping APIs.

Changes:

Replaces generated htmlTrie: Map with a default-exported serialized string parsed at startup, plus an asciiEntities[] lookup table for faster ASCII encoding.
Changes numeric entity fallback output from hexadecimal to decimal in encodeHTML/encodeNonAsciiHTML and encodeXML (and updates affected tests).
Removes src/internal/encode-shared.ts by inlining the trie parsing logic into src/encode.ts and updating the generator script.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/internal/encode-shared.ts	Deleted; trie parsing logic moved into `src/encode.ts`.
src/encode.ts	Adds runtime parsing of serialized trie + ASCII fast-path; switches numeric refs to decimal.
src/generated/encode-html.ts	Now exports a compact serialized string instead of a parsed `Map`.
scripts/write-encode-map.ts	Updates generator to serialize in new format; picks shortest entity name; strips ASCII children.
src/escape.ts	Switches `encodeXML` numeric output to decimal.
src/index.spec.ts	Updates expectations for decimal numeric entities.
src/encode.spec.ts	Updates expectations for decimal numeric entities.
src/decode.spec.ts	Formatting-only change (type annotation wrapping).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/encode.ts

Copilot · 2026-03-29T11:33:51Z

src/encode.ts

+        const entityValue =
+            data.charCodeAt(cursor) === 123 /* '{' */ ? null : readEntity();
+
+        if (data.charCodeAt(cursor) === 123 /* '{' */) {
+            cursor++; // Skip '{'


The parser only writes to asciiEntities[...] when an ASCII entry has no children. If you ever decide to keep ASCII starter children in the serialized trie (instead of stripping them), the encoder would still go through the ASCII fast-path and would miss those values unless asciiEntities is populated for branch nodes too. Consider either enforcing "no ASCII children" as an invariant (and documenting it), or always populating asciiEntities when lastKey < 0x80 and an entityValue is present.

src/escape.ts

src/encode.ts

scripts/write-encode-map.ts

…dd regression tests Keep ASCII trie children (e.g. < + U+20D2 → &nvlt;, = + U+20E5 → &bne;, > + U+20D2 → &nvgt;) in the serialized data. The encoder now checks the trie for multi-char matches before falling back to the ASCII fast-path.

- Restore JSDoc comments on TrieNode interface fields (write-encode-map.ts) - Rename `ai` → `astralIndex` for clarity - Restructure encoder: ASCII fast-path first, trie lookup only for multi-code-point children fallback - Remove unnecessary bare block wrapper

coderabbitai · 2026-03-30T09:28:02Z

📝 Walkthrough

Walkthrough

Runtime now builds the HTML encode trie from a serialized JSON string instead of importing a pre-parsed trie; numeric character references changed from hexadecimal to decimal across encoding/escaping; encode-shared was removed; tests and the trie-generation script were updated accordingly.

Changes

Cohort / File(s)	Summary
Trie generation script `scripts/write-encode-map.ts`	Now emits a JSON-serialized `encode-html.ts` (base-10 numeric format) instead of embedding `parseEncodeTrie(...)` and `encode-shared` references; logging updated to include data length.
Runtime trie parsing & encoder `src/encode.ts`	Removed external trie import; added in-module `EncodeTrieNode` type, runtime parsing of `htmlTrieData` into `Map<number, EncodeTrieNode>`, `asciiEntities` table, and multi-code-point matching; switched numeric fallbacks from hex to decimal and switched to `codePointAt` with surrogate handling; lazy `out` allocation adjusted.
Removed shared utility `src/internal/encode-shared.ts`	Deleted: previously exported `EncodeTrieNode` type and `parseEncodeTrie(serialized)` implementation removed.
XML escaping `src/escape.ts`	Changed XML numeric references from hex (`&#x...;`) to decimal (`&#...;`); uses `codePointAt` for code points and preserves surrogate handling.
Tests — encoding/escaping expectations `src/encode.spec.ts`, `src/index.spec.ts`	Updated assertions to expect decimal numeric references instead of hexadecimal; added tests for ASCII + combining/multi-code-point sequences; emoji and surrogate-related expectations adjusted.
Tests — minor formatting `src/decode.spec.ts`	Type annotation formatting for a test-local `callback` variable changed (no runtime behavior change).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I nibble hex and hop to ten,
Trie unrolled at runtime again,
Shared file shelved, entities renewed,
Decimal signals in every queued.
Hooray — small hops, big changes, chew!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the three main objectives: trie optimization, selecting shortest entities, and switching to base-10 numerics.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch encode-perf

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fb55 · 2026-03-30T09:31:12Z

@coderabbitai review

coderabbitai · 2026-03-30T09:31:17Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/encode.ts (1)

158-159: ⚠️ Potential issue | 🟡 Minor

JSDoc still references hexadecimal format.

The JSDoc comments for encodeHTML (line 158-159) and encodeNonAsciiHTML (line 170-171) still say "numeric hexadecimal reference (eg. ü)" but the implementation now emits decimal references.

📝 Proposed fix to update JSDoc

 /**
  * Encodes all characters in the input using HTML entities. This includes
  * characters that are valid ASCII characters in HTML documents, such as `#`.
  *
  * To get a more compact output, consider using the `encodeNonAsciiHTML`
  * function, which will only encode characters that are not valid in HTML
  * documents, as well as non-ASCII characters.
  *
- * If a character has no equivalent entity, a numeric hexadecimal reference
- * (eg. `&#xfc;`) will be used.
+ * If a character has no equivalent entity, a numeric decimal reference
+ * (eg. `&#252;`) will be used.
  * `@param` input Input string to encode or decode.
  */
 export function encodeHTML(input: string): string {

 /**
  * Encodes all non-ASCII characters, as well as characters not valid in HTML
  * documents using HTML entities. This function will not encode characters that
  * are valid in HTML documents, such as `#`.
  *
- * If a character has no equivalent entity, a numeric hexadecimal reference
- * (eg. `&#xfc;`) will be used.
+ * If a character has no equivalent entity, a numeric decimal reference
+ * (eg. `&#252;`) will be used.
  * `@param` input Input string to encode or decode.
  */
 export function encodeNonAsciiHTML(input: string): string {

Also applies to: 170-171

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/encode.ts` around lines 158 - 159, Update the JSDoc for encodeHTML and
encodeNonAsciiHTML to reflect that the implementation emits numeric decimal
references (e.g. &#252; or &#xfc; in decimal form) instead of hexadecimal;
locate the JSDoc blocks above the encodeHTML and encodeNonAsciiHTML function
declarations and change the phrase "numeric hexadecimal reference (eg.
`&#xfc;`)" to something like "numeric decimal reference (eg. `&#252;`) or simply
"numeric character reference (decimal)". Ensure both encodeHTML and
encodeNonAsciiHTML comments are updated consistently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/encode.ts`:
- Around line 158-159: Update the JSDoc for encodeHTML and encodeNonAsciiHTML to
reflect that the implementation emits numeric decimal references (e.g. &#252; or
&#xfc; in decimal form) instead of hexadecimal; locate the JSDoc blocks above
the encodeHTML and encodeNonAsciiHTML function declarations and change the
phrase "numeric hexadecimal reference (eg. `&#xfc;`)" to something like "numeric
decimal reference (eg. `&#252;`) or simply "numeric character reference
(decimal)". Ensure both encodeHTML and encodeNonAsciiHTML comments are updated
consistently.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d9108289-fe74-464a-b161-e7ee09bd6288

📥 Commits

Reviewing files that changed from the base of the PR and between 1dad19d and 43b47f0.

⛔ Files ignored due to path filters (1)

src/generated/encode-html.ts is excluded by !**/generated/**

📒 Files selected for processing (7)

scripts/write-encode-map.ts
src/decode.spec.ts
src/encode.spec.ts
src/encode.ts
src/escape.ts
src/index.spec.ts
src/internal/encode-shared.ts

💤 Files with no reviewable changes (1)

src/internal/encode-shared.ts

The implementation was changed from hex (ü) to decimal (ü) but the JSDoc comments still referenced hexadecimal.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/encode.ts`:
- Line 245: Replace the deprecated call to input.substr with input.substring to
avoid relying on Annex B legacy features; locate the expression using
input.substr(last) (the variables input, last, length, out are nearby) and
change it so the remaining tail of the string is appended using
input.substring(last) while preserving the existing behavior and assignment to
out.
- Around line 88-91: The child-parsing loop in readEntity (uses readGap and
childKey/next) assumes children are flat (only leaf entities) and doesn't
recurse into nested `{...}` blocks; update the code by either (A) adding a clear
comment at the readEntity loop documenting the bounded-depth assumption (that
all entities are 2 code-points and trie depth ≤2), and (B) add a validation in
scripts/write-encode-map.ts that the generated encode trie never exceeds depth 2
(fail the build or throw a clear error if deeper nodes are produced) so the
assumption is enforced upstream. Ensure references to readEntity, readGap,
childKey, next are mentioned in the comment and the validation checks the trie
depth when constructing/writing the map.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b2278f83-9025-4933-90f8-a71966747584

📥 Commits

Reviewing files that changed from the base of the PR and between 43b47f0 and d67be1f.

📒 Files selected for processing (1)

src/encode.ts

coderabbitai · 2026-03-30T10:41:01Z

src/encode.ts

+            while (data.charCodeAt(cursor) !== 125 /* '}' */) {
+                childKey += readGap() + 1;
+                next.set(childKey, readEntity());
+            }


🧹 Nitpick | 🔵 Trivial

Child parsing assumes flat structure (depth-1 children only).

The child parsing loop reads only leaf entities and doesn't recursively parse nested {...} blocks. This works correctly for current HTML5 entities (all multi-code-point entities have exactly 2 code points), but if a 3+ code-point entity were added in the future, the parser would malfunction.

Consider adding a brief comment to document this assumption, or verify upstream (in scripts/write-encode-map.ts) that the trie depth is bounded.

📝 Suggested documentation

while (data.charCodeAt(cursor) !== 125 /* '}' */) { childKey += readGap() + 1; + // NB: Children are assumed to be leaf nodes (no nested {...} blocks). + // This holds for all HTML5 entities (max 2 code points per entity). next.set(childKey, readEntity()); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/encode.ts` around lines 88 - 91, The child-parsing loop in readEntity (uses readGap and childKey/next) assumes children are flat (only leaf entities) and doesn't recurse into nested `{...}` blocks; update the code by either (A) adding a clear comment at the readEntity loop documenting the bounded-depth assumption (that all entities are 2 code-points and trie depth ≤2), and (B) add a validation in scripts/write-encode-map.ts that the generated encode trie never exceeds depth 2 (fail the build or throw a clear error if deeper nodes are produced) so the assumption is enforced upstream. Ensure references to readEntity, readGap, childKey, next are mentioned in the comment and the validation checks the trie depth when constructing/writing the map.

coderabbitai · 2026-03-30T10:41:01Z

src/encode.ts

-    if (out === undefined) return input;
+    // If nothing needed encoding, return the original string (avoids allocation).
+    if (out == null) return input;
    if (last < length) out += input.substr(last);


🧹 Nitpick | 🔵 Trivial

Prefer substring over deprecated substr.

String.prototype.substr is specified in ECMAScript Annex B (legacy features) and is not recommended for new code. Both methods behave identically when called with a single argument.

♻️ Suggested fix

- if (last < length) out += input.substr(last); + if (last < length) out += input.substring(last);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (last < length) out += input.substr(last);

if (last < length) out += input.substring(last);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/encode.ts` at line 245, Replace the deprecated call to input.substr with input.substring to avoid relying on Annex B legacy features; locate the expression using input.substr(last) (the variables input, last, length, out are nearby) and change it so the remaining tail of the string is appended using input.substring(last) while preserving the existing behavior and assignment to out.

refactor(encode): pick shortest entities, use base-10 numeric entitie…

f463618

…s, optimize trie

Copilot AI review requested due to automatic review settings March 29, 2026 11:28

Copilot started reviewing on behalf of fb55 March 29, 2026 11:29 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

fb55 added 3 commits March 29, 2026 12:40

docs(encode): fix stale hex references in JSDoc, clarify ASCII fast-path

2d6501e

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

docs: update JSDoc to reference decimal numeric entities

d67be1f

The implementation was changed from hex (ü) to decimal (ü) but the JSDoc comments still referenced hexadecimal.

coderabbitai bot requested changes Mar 30, 2026

View reviewed changes

	if (last < length) out += input.substr(last);
	if (last < length) out += input.substring(last);

Uh oh!

Conversation

fb55 commented Mar 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Breaking changes

Benchmark results

Bundle size (esbuild --minify --tree-shaking, esm)

Summary by CodeRabbit

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

fb55 commented Mar 30, 2026

Uh oh!

coderabbitai bot commented Mar 30, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fb55 commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading